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Conditional Vertical Subsampling— 
A Technique to Assist in the 
Coding of Television Signals 


By R. F. W. PEASE 
(Manuscript received July 23, 1971) 


In an interlaced scan television system, the vertical sampling rate of 
an image can be halved by sampling every other field. Picture elements in 
the missing fields are replaced in the display by both temporal and vertical 
interpolation, but the resulting pictures show some visible defects. This 
paper describes how these defects can be eliminated at the extra cost of fully 
sampling in selected areas of the picture. For a typical Picturephone~ 
scene with active movement the selected areas make up about 6 percent 
of the picture elements in the unsampled field. The technique can be combined 
with a wide variety of interframe-coding techniques. In one particular 
example in which the television signal ts specified as clusters of frame-to- 
frame differences, the cost of specifying ‘active’ frames (14,000 significant 
frame differences per frame) 1s reduced from 68,000 bits to 42,500 bits. 
This corresponds to a reduction in bit rate from 2 Mbits sec™* to 1.38 Mbits 
sec. 


I. INTRODUCTION 


The technique of reducing the horizontal-sampling frequency (‘‘sub- 
sampling”’) in the moving parts of the television image has been pre- 
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viously described.’ It was found that the frequency could be halved 
without visible degradation for most object speeds. For slow speeds 
the degradation was visible but not objectionable. Combining this 
technique with that of conditional replenishment” has proved particu- 
larly effective,’ because in periods of fast movement, conditional 
replenishment by itself becomes uneconomic because so many picture 
elements (pels) change significantly from frame to frame. 

An obvious question to ask is whether a similar advantage results 
from subsampling vertically. 

In an interlaced television scan format, the idea of halving the vertical- 
sampling frequency can be confusing because we must distinguish 
between fields and frames. A diagram showing the vertical position of 
lines for successive fields is shown in Fig. 1. One method of halving 
the vertical-sampling frequency is to sample every second line in 
each field and to replace the unsampled lines by interpolating the values 
of vertically adjacent elements in the same field [e.g., an unsampled 
line with coordinates y = 2 and t = 2 is replaced by an average of 
lines with y, ¢ coordinates (0, 2) and (4, 2)]. This has been tried and 
the degradation is subjectively objectionable. 

The second method is to sample alternate fields so that in stationary 
pictures the vertical-sampling frequency is halved. The unsampled 
fields are replaced by an average of the four nearest neighbors in Fig. 1 
(e.g., an unsampled line with coordinates y = 2 and t = 2 is replaced 
by an average of lines with y, ¢ coordinates 1 , 1; 1, 3;3,1;and3, 8). 
With this method the resulting pictures are subjectively satisfactory 
except for fast moving contrasty edges which appear blurred and 
somewhat jerky, and in dark areas with little horizontal but much 
vertical detail;* in these areas an aliasing pattern is sometimes visible. 


t (FIELD NO.) 
0 2 4 6 


Y (LINE NO.) 





Fig. 1—Temporal position (or Field No., ¢) and vertical position (or Line No., y) 
of lines in a 2:1 interlace scan. 
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Henceforth, we will use the term “vertical subsampling’” to refer to 
this second method. 

If we vertically subsample over most of the moving regions except 
where there are visible defects, the entire picture should be subjectively 
satisfactory. We call this technique, conditional vertical subsampling 
(CVSS). The question is: How much extra channel capacity is needed 
to bring about the required improvement? This paper describes some 
experiments designed to answer this question. 

Because the extra information is generated at an irregular rate, the 
use of conditional vertical subsampling requires a buffer memory and 
hence is probably most useful in conjunction with a buffered coder. 
In this paper we have in mind the eventual use of conditional vertical 
subsampling in a buffered interframe coder similar to those described 
in Refs. 2 and 3. 


II. EXPERIMENTAL ARRANGEMENT 


The basic apparatus is shown in Fig. 2. The output of the television 
camera (a silicon diode array vidicon) is sampled at 2.02 MHz and is 
digitized to 8-bit accuracy. The scan format is similar to that used in 
the Picturephone service in that there are 271 lines per frame and 60 
fields per second: Each frame is made up of two interlaced fields. 

Consider first only the odd-numbered fields. The coded version of a 
previous odd field appears at the output of field delay 1 and is compared, 
picture element by picture element, with the digitized input to determine 
whether the stored value is an adequate representation of the current 
value. For any pel, if the difference between the input signal and the 
stored signal exceeds a value (7'1) of four levels out of 255, then ¢,; = 1 
and S1 is switched to the ‘‘one”’ position so that the value of the input 
replaces the previous value in the field delay. This new value is circulated 
twice in the field delay (because S1 is held at 0 during even fields) and 
after exactly one frame time is compared with the new input signal. 
Usually, in the absence of movement, the comparison is good enough 
so that no further updating need take place (i.e., no new information 
need be sent to a hypothetical receiver). Thus, each odd field is condi- 
tionally replenished as described in Ref. 2. 

When the replenished version of line 3 , 3 (1.e., line No. 8, field No. 3) 
emerges first from switch Sl, the replenished version of line 1, 3 is 
appearing at the output of the line delay. Similarly, the replenished 
version of line 3, 1 is appearing at the output of field delay 1 and the 
line 1, 1 is appearing at the output of the second half-line delay. Thus, 


“Nn 
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SUBTRACTOR 1 
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8 BIT PCM 
VIDEO SIGNAL 






FIELD DELAY + 
1/2 LINE DELAY 






WAVEFRONT V ADDER 
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ete (eh Ss 
Sen ee ae ae ous ae 
DIGITAL TO 
| 33.3 MSEC A (A+B)/2 ANALOG. CONVERTER 
16.7 MSEC AND DISPLAY 


Fig. 2—Basic diagram of apparatus used to evaluate conditional vertical subsam- 
pling. Odd fields are coded by conditional (frame) replenishment using the loop 
formed by switch S1 and field delay 1; the output for these fields appears at 0 on 
switch $3. The output for even fields appears at 1 on switch $3 and is derived either 
from the vertically subsampled signal at the output of adder A3 or from the original 
output delayed by 1 field and 1/2 line. 


lines 3,3; 1,3; 3,1; and 1,1 are simultaneously available and so, 
using adders Al, A2, and A8, we can form the spatially and temporally 
interpolated value, which as described in the previous section, is dis- 
played instead of line 2, 2 when there is subsampling. 

To maintain the correct temporal and spatial sequence of displayed 
fields, the replenished version of line 3, 3 is not displayed until one 
field time plus one-half line time after first appearing at switch S1. 
Thus, line 3, 3 and other lines in odd fields are taken from the output of 
the first one-half line delay, and if we wish to vertically subsample 
continuously, the output is taken from adder A3 for even fields (i.e., 
82 is kept in the 1 position and $3 is switched by waveform V). 

However, we want to investigate how much the picture quality is 
improved by replacing the interpolated value with the original signal 
in selected parts of the even fields. To do this in the proper time sequence, 
the original signal is delayed by one field time plus one-half line time 
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and then compared with the interpolated value in subtractor 2. When 
the magnitude of the difference equals or exceeds the threshold T2, 
then t, = 1 and the original is displayed. Otherwise the interpolated 
value is displayed. In a practical coder the position and amplitude of 
those parts in the original signal used to replace the interpolated value 
must be coded and transmitted to the receiver. It is the cost of trans- 
mitting this extra information which must be balanced against the 
resulting improvement in picture quality. 

The output of switch 83, which corresponds to the output of the 
hypothetical receiver, is converted to an analog signal and displayed 
on a television monitor with a 5-1/2 inch by 5-inch raster and a polarizing 
faceplate. The picture was viewed at 36 inches in a room with average 
illumination for an office (70 foot-candles). 

Three series of experiments were carried out. In the first series, the 
subjective effect of varying the threshold 72 was investigated. In the 
second series, we measured the frequency of occurrence of picture 
elements for which the difference between the interpolated value and 
the delayed signal equals or exceeds T2 (henceforth we shall refer to 
such events as VSS differences). In the third series, we investigated 
the subjective and numerical effects of grouping the VSS differences 
into clusters so that a separate address word need not be used for 
each VSS difference. For most experiments requiring numerical results, 
the scene was the swinging model head shown in Fig. 3. The period of 
the swing was 2.7 seconds and the amplitude corresponded to 44 picture 
elements. The maximum speed of the head corresponded to 3-1/2 pels 
per frame interval. From time to time, some experiments were also 
carried out with live subjects or with very contrasty material. 


III. RESULTS AND DISCUSSIONS 


3.1 Subjective Effects of Varying the Threshold T2 


The results of the two extremes is already known; for T2 = 1, we 
have the original picture and for T2 = 255, we have the vertically 
subsampled picture with the aforementioned defects. With a threshold 
of T2 = 4 (out of 255 levels), close scrutiny (15 inches) of the hair of the 
model when stationary revealed a just detectable difference from 
the original only if the original was compared instantaneously; otherwise 
there was no difference between the two pictures. With T2 = 8, there 
was a slight loss in vertical detail in the hair of the model which again 
was only noticeable at the normal viewing distance by switching 
instantaneously to the original. In some areas containing a lot of dark 
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Fig. 83—Swinging model head scene. The head is swung with a peak amplitude 
shown by the distance between adjacent vertical pencils. 


vertical detail but little horizontal detail, the low contrast aliasing 
pattern could be made out. For 72 = 12, there was a more visible loss 
of vertical detail of stationary pictures and fast movement (4 pels per 
frame interval and above) of contrasty edges (64 levels per pel) showed 
some smearing or occasional raggedness. For T2 = 16, the above effects 
were more pronounced. As the threshold was increased beyond 16, 
the above effects became progressively more serious. For T'2 = 32, the 
picture quality was virtually the same as for T = 2565. 


3.2 Frequency of Vertical Subsampling Differences For Even Fields 


In even fields we counted the number of VSS differences for various 
values of 72. The numbers were recorded on a digital printer with a 
sampling frequency of about 5 Hz; the scene was the swinging model 
head. Simultaneous counts were also made of the frame differences 
exceeding a threshold of 4 (out of 255 levels) occurring in the odd fields. 
This count served as a check for the repeatability of the scene and 
also gave a measure of the amount of activity in the scene. 

Some of the results are shown in Fig. 4. The dotted lines indicate 
the frequency of frame differences and show the characteristic periodic 
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pattern as the head swings. The variability in the depth of the trough 
is probably due to the relatively low sampling frequency. The maximum 
activity generates more than 7000 significant frame differences in the 
odd field which, in comparison with other Picturephone scenes is con- 
sidered ‘‘active.”* The corresponding curves for the vertical sub- 
sampling differences show the same form of activity but are much less 
variable. For instance, for 72 = 8 the variation is from 1100 per field 
at the end of the swing to 1900 per field at the bottom of the swing. 
This is understandable as those vertical subsampling differences which 
arise from the spatial interpolation tend to remain unchanged as the 
movement in the picture decreases. 


Ts: THRESHOLD BETWEEN INPUT SIGNAL AND 
INTERPOLATED SIGNAL 


T,; : THRESHOLD BETWEEN INPUT SIGNAL AND STORED 
SIGNAL (DELAYED BY ONE FRAME) 


NUMBER PER FIELD OF 
SIGNIFICANT DIFFERENCES (+ 1000) 


EXPANDED SCALE 





ce] 1 2 3 4 5 6 7 8 
TIME IN SECONDS 


Fig. 4—Frequency of significant differences as a function of the scene (swinging 
model head). 
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Note the big decrease in counts of VSS differences as the threshold 
increases. For T2 = 8, which for a scene with movement gives a picture 
virtually indistinguishable from the fully sampled picture, there are 
only about 2000 significant VSS differences compared with more than 
7000 significant frame differences. For 72 = 16, which still gives a 
substantial improvement in picture quality over that with T2 = 255, 
the frequency of significant VSS differences is less than 500 which is 
negligible compared with the frequency of significant frame differences. 

If the threshold 71 (defining a significant frame difference) is raised 
to 8, then the picture quality is visibly degraded due to the ‘‘dirty 
window effect.’’” One other effect is to reduce the frequency of significant 
frame differences from about 7000 per field to about 5000 per field when 
the model head is at the bottom of the swing (Fig. 5). A third effect 
of increasing 7'1 from 4 to 8 is to increase the counts of significant VSS 
differences when T2 = 8 from about 2000 to 2600 per field. This increase 
can be seen by comparing the relevant graphs in Figs. 4 and 5. The 
increase is not noticeable for values of T2 greater than 10. This third 
effect can be easily explained by the increased difference between the 
respective odd fields and the original signal being added to the difference 
introduced by interpolation. Note however, that as both thresholds 
increase from 4 to 8, the counts of VSS differences decreased more than 
the corresponding counts of significant frame differences in spite of the 
fact that the subjective degradation of increasing 71 is more severe 
than that due to increasing 72. 


3.3 Cost of Transmission 


One way to transmit the extra information would be to use 8 bits 
to specify the horizontal position and another 8 bits to specify the 
revised amplitudes of each picture element showing a significant VSS 
difference. However, this is probably wasteful because: (2) the significant 
VSS differences tend to occur in clusters so that usually one address 
word can serve several VSS differences; the cost of this method is that 
either the length of the cluster or the end of the cluster must also be 
sent to the receiver. (77) 8 bits are probably unnecessary for the amplitude 
information as the extra information can probably be adequately 
specified as a relatively coarsely quantized difference signal. 


3.3.1 Effect of Cluster Coding 


Direct observation of the spatial distribution of significant VSS 
differences (Fig. 6) shows that they, like frame differences, tend to 
occur in clusters and so cluster coding is probably advantageous. 
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To accentuate this clustering and to reduce further the number of 
bits required for addressing, it is usually advantageous to run two 
closely separated clusters into one larger cluster.’ This is particularly 
important because the VSS differences due to temporal interpolation 
near a horizontally moving edge occur as two separate clusters (with 
differences of opposite sign) on either side of the edge but at the edge 
itself the differences are usually less than significant. This effect can be 
seen in Fig. 6 around the edges of the cheek of the model head; the 
reader may wish to verify exactly how this comes about by drawing 
out the video signal corresponding to a moving step on a target of a 
storage type television camera for three successive fields or by referring 
to the Appendix. 

It also follows that isolated significant VSS differences are relatively 


NUMBER PER FIELD OF SIGNIFICANT DIFFERENCES (+ 1000) 





TIME IN SECONDS 


Fig. 5—Frequency of significant differences as a function of time. Scene is a swing- 
ing model head. 
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Fig. 6—F lags representing VSS differences for 72 = 8 while the head is swinging. 


expensive to transmit, and they should not be transmitted if such rejec- 
tion does not cause subjective degradation of the picture.* We investi- 
gated this effect by leaving uncorrected single VSS differences with no 
neighboring VSS differences within 2 pels horizontally. The subjective 
effect was negligible except for values of T2 of 12 and above, when 
occasionally the loss in vertical detail could be made out or a moving 
edge appeared somewhat more ragged. To control switch S2 in Fig. 2, 
the subtractor and threshold circuit were followed by logic to reject 
isolated VSS errors and to run together clusters of VSS differences 
separated by one or two pels. We also rejected VSS differences in the 
first and last lines of even fields because vertical interpolation here 
will be very frequent but rejecting these values is subjectively negligible. 
With these restrictions, we measured the frequency of clusters and 
of pels contained in clusters. Some of the results are summarized in 
Table I and indicate that the average length of clusters remained at 
about 5 irrespective of picture motion. 

For comparison, equivalent data are shown for frame differences 
(using the same logic in recording the counts) and show that when the 
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scene is active the cluster length is longer for frame differences. These 
data correspond very well with equivalent data reported for scenes with 
live subjects.° 


3.3.2 The Effect of Quantizing the Signal 


Amplitude information in television signals is usually most econom- 
ically specified by quantizing a prediction error. For example, the 
amplitude of the previously scanned pel is often used as a prediction 
of the current pel; the difference is then quantized and coded for trans- 
mission. This technique could be used here but we decided instead to 
quantize the prediction error generated by the interpolated value and 
the input. Such errors tend to be smaller than either element-to-element 
or frame-to-frame differences. In the Appendix, we show how this comes 
about in one particular case. 

In Fig. 7 we have replotted some of the data of Fig. 4 and some extra 
data to show the frequency distribution of amplitudes of VSS differences 
for the swinging head. The curve is more peaked than comparable 
curves of element differences or frame differences and shows very few 
differences of amplitude greater then 10 percent of the peak amplitude. 
Thus quantizing the VSS difference directly is one attractive approach 
and should cost no more than 8 or 4 bits per pel if fixed length codes 
are used and perhaps much less if variable lengths are used. 

As a first step, we divided the magnitude of the difference amplitude 
scale up into seven classes as shown in Table II and then assigned 
weights as shown to each class. The pictures that resulted from cor- 
recting the VSS errors with these quantized signals showed little 
difference from those in which the VSS errors are corrected with an 
8-bit signal when the scene of the swinging head and the threshold 8 was 
used. Thus, a fixed code length of 3 bits would be sufficient to describe 
the amplitude information for each pel and still allow one word to 
denote the end of a cluster. However, when the scene contained a very 


TaBLE IJ—FREQUENCY OF FRAME DIFFERENCES, VSS DIFFERENCES, 
AND CLUSTERS FOR DIFFERENT THRESHOLDS FOR THE 
Swineine Heap Scene (Fig. 2) 





End of Swing Bottom of Swing 
Type of Difference] Frame | VSS | VSS | VSS | Frame | VSS | VSS | VSS 
Threshold Tl =4 8 12 16 4 8 12 16 
No. of Differences| 123 | 750 | 350 | 280 | 7000 {1400 | 650 | 400 Per Field 
No. of Clusters 47 125 60 35 440 260 | 140 80 Per Field 
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0 4 8 {2 1% 20 24 28 32 36 
AMPLITUDE (MAXIMUM VALUE 255 LEVELS) 


Fig. 7—Histogram of number of VSS differences as a function of amplitude values 
of 4 and above. 


contrasty (50 percent of peak amplitude) edge moving at 4 pels per 
frame interval or faster, there was visible degradation of the edge. There- 
fore, a 15-level quantizer with the decision levels and weights shown 
in Table III was used so that a 4-bit vocabulary would suffice for the 
amplitude information and one word would be left to denote the end 
of a cluster. Resulting pictures were indistinguishable from those using 
an 8-bit correction signal. 

If for each cluster we allow 4 bits per pel, 8 bits for the horizontal 
address, and 4 bits to denote the end of run, then we can plot the data 
which was summarized in Table I to show the total number of bits 
per field (excluding synchronizing bits) needed to specify the extra 


TaBLE []—7-LEvEL QUANTIZING SCALE 





Decision Level 0 +8 +15 +25 
out of 255 
Representative 
Weight 0 +10 +19 +28 





CONDITIONAL VERTICAL SUBSAMPLING 799 


TasBLE [JJ—15-LEvEL QUANTIZING SCALE 


Decision Level O +3 +6 +10 +15 +23 +33 +44 


; out of 255 
Representative 


Weight QO +44 +8 +12 +18 +28 +38 +50 


information for correcting significant VSS differences (Fig. 8). For 
T2 = 8, which gives very satisfactory correction of the VSS differences, 
and for a frame with 14,000 significant frame differences, the even field 
can be sent at a cost of only 8500 bits. For 7'2 = 12 the cost comes down 
to 4500 bits and for 72 = 16 there is a further drop to 1600 bits. 

For comparison, the cost, allowing 4 bits per frame difference and 
12 bits per cluster of transmitting the frame differences, is also shown. 


35 


30 (a) BITS NEEDED TO TRANSMIT 
EACH ODD FIELD 


(b) BITS NEEDED FOR EVEN 
FIELD T> = 6 

(C) BITS NEEDED FOR EVEN 
FIELD Ta =12 


25 


20 


NUMBER BITS PER FIELD (x 103) 





O-NWAYW 


c@) { 2 3 4 5 6 7 8 
TIME IN SECONDS 


Fig. 8—Number of bits required to transmit each odd field by conditional replenish- 
ment and each even field by conditional vertical subsampling. For two values of T2. 
Scene: Swinging model head. In each field the pels requiring transmission are 
grouped in clusters by first rejecting isolated changes and then running together 
clusters separated by 1 or 2 pels. Twelve bits are assigned to each cluster and 4 bits to 
each pel containing a cluster. Synchronizing bits have been excluded. 
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Thus even though the frame differences occur in longer clusters in 
active frames, the cost of transmitting the odd field is much greater 
than the cost of transmitting the even field. 


Iv. CONCLUSIONS 


For the scene used we can now answer the original question of com- 
paring the advantages of horizontal subsampling and vertical sub- 
sampling by comparing the total cost of transmitting one frame (exclud- 
ing synchronizing bits and “forced replenishment’”’’® bits). For vertical 
subsampling we need, for an active frame, 34,000 bits for the odd field 
and 8,500 bits in the even to give a subjectively pleasing picture; this 
gives a total of 42,500 bits. When the moving parts of the picture are 
horizontally subsampled, the number of bits used to specify amplitude 
is halved but the number required for addressing is unchanged. For 
two fields, therefore, there are 28,000 bits for the amplitude information 
and 12,000 bits for the address information to give a total of 40,000 bits. 
Thus, the numerical advantage of the two techniques is very similar. 

It should be pointed out that in the experiments described the vertical 
subsampling was applied to both the moving and stationary part of the 
picture. With a buffered interframe coder, the main problem is created 
by the moving parts of the picture. If the vertical subsampling is applied 
only to the moving part then the number of bits needed during the 
even fields is reduced still further. However, as can be seen from Fig. 6, 
there are relatively few VSS differences in the background and so in this 
case the savings would be marginal. The maximum savings so gained 
can be estimated by assuming that the only VSS differences that need 
correcting are those due to movement. The number of such differences 
can be estimated by assuming that they account for the difference 
between the number of VSS differences occurring for stationary scenes 
and for moving scenes. Thus in our experiment and for T2 = 8, the 
number of bits in the even field is approximately halved, and the total 
number of bits required to code the frame is reduced from 42,500 
to 38,000. 

One advantage of the technique of conditional vertical subsampling 
is that it can be combined with a wide variety of interframe-coding 
techniques, because the odd fields are not affected. One exception is 
that it is no longer straightforward to express amplitude information 
in the odd fields as field-to-field differences.” Otherwise the amplitude 
information in the odd fields can be expressed as element-to-element 
differences,” frame differences,’ or even two-dimensional spatial dif- 
ferences.’ How well the techniques of horizontal and vertical sub- 
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sampling can be simultaneously applied remains to be investigated. 
Horizontal subsampling in the odd field will certainly introduce extra 
VSS differences, but the numerical effect of these can probably be 
minimized by horizontally subsampling in the even fields as well. 
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APPENDIX 


Vertical Subsampling Differences Arising From the Horizontal Movement 
of a Vertical Edge 


Consider a vertical edge, consisting of a black to white transition of 
128 levels, being moved horizontally across the field of view of a tele- 
vision camera at a speed of p pels per frame interval when referred to 
the target of the camera. Assume that the video-signal level for each 
pel is a direct measure of the quantity of light which has fallen on the 
appropriate area of the target for 1/30 second just prior to being scanned. 
If the target is stationary, then a plot of signal level as a function of 
horizontal position is shown by the line ABCD in Fig. 9. If immediately 
after pel 0 is scanned, the edge moves horizontally to the right at 
8 pels per frame interval, then in the next frame the signal level will be 
a measure of the amount of time during the intervening 1/30th of a 
second that each point on the target received light from the bright side 
of the edge. For uniform motion therefore, the line ABED represents 
the signal level in the second frame. In the third frame the signal level 
is represented by the line AHD which is simply a translation of ABED 
by 8 pels. The signal level for the adjacent lines in the interlaced field 
is shown by the line AFGD and is simply a translation of ABED by 
4 pels (there being no vertical difference). When vertically subsampling 
however, we substitute for the interlaced field, an interpolated signal 
ABD and the VSS differences are represented by the vertical distances 
between lines BD and BFGD. Note that at the halfway point this 
difference is zero and on either side of the edge we have relatively large 
differences with a maximum value of 32 levels. This is in agreement 
with the ‘flags’? shown in Fig. 6, which occur in pairs of clusters strad- 
dling the edge of the moving cheeks. 

We can also use Fig. 9 to compare certain differences in level which 
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Fig. 9—Signal levels as a function of horizontal position for different fields. 


may be quantized or coded for transmission. First of all, element-to- 
element differences: For a stationary position we have a maximum step 
height of 128 levels which decreases as the speed increases. 

Frame-to-frame differences for a steadily moving edge are represented 
by the vertical distances between lines ABED and lines ABHD. The 
differences here have values up to 128 levels (a more accurate analysis 
shows that this level is only approached for values of p > 1). 

Field-to-field differences are represented by the vertical distance 
between ABED and AFGD and can take values up to 64 levels. 

Thus, VSS differences have, for the model, the smallest maximum 
value and would seem to be most economical to quantize and code. 
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Nonlinear Codes 
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In recent years a number of nonlinear codes have been discovered which 
have better error-correcting capabilities than any known linear codes. 
However, very little ts known about the properties of such codes. In this 
paper we study the most basic property, the weight enumerator. The weight 
of a codeword is the number of its nonzero components; the weight enum- 
erator gives the number of codewords of each weight, and is fundamental 
for obtaining the error probability when the code is used for error-correction 
on a noisy channel. In 1963 one of us showed that the weight enumerator 
of a linear code 1s related in a simple way to that of the dual code (Jessie 
MacWilliams, ‘A Theorem on the Distribution of Weights in a Systematic 
Code,” Bell System Technical Journal, 42, No. 1 (January 1968), 
pp. 79-94). In the present paper, which ts a sequel, we show that the same 
relationship holds for the weight enumerator of a nonlinear code. Further- 
more, a definition is given for the dual @* of a nonlinear binary code @ 
which satisfies (@*)* = @ provided @ contains the zero codeword. 


I. INTRODUCTION 


In recent years a number of nonlinear codes have been discovered 
which have better error-correcting capabilities than any known linear 
codes (e.g., Refs. 1 and 2). However, very little is known about the 
properties of such codes. In this paper we study the most basic property, 
the Hamming weight enumerator (defined in Section IT), which gives 
fundamental information about the error probability when the code is 
used in various error-correction schemes (Ref. 3, Ch. 16). In 1963 one 
of us showed that the Hamming and the complete weight enumerators 
of a linear code are related in a simple way to those of the dual code 
(Ref. 4; Theorems 1 and 3 below). The requirement that the code be 
linear is unsatisfactory for two reasons: (7) Several pairs of nonlinear 
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codes @, ® are known whose weight enumerators satisfy Theorem 3. 
One example of such a pair is given by the Preparata” and Kerdock' 
codes, another by the code shown in Fig. 1. (22) The important theorem 
of S. P. Lloyd (giving a necessary condition for the existence of a 
prefect code) may be deduced for linear codes as a corollary to Theorem 3 
(Ref. 4, Lemma 2.15), but may be proved directly without assuming 
linearity (Ref. 5; Ref. 6, p. 111). 

It is the purpose of the present paper, therefore, to define the ‘‘weight 
enumerators of the dual code” so as to make Theorems 1 and 3 (and the 
corresponding theorem for the Lee weight enumerator, Theorem 2) 
valid even for nonlinear codes. 

Furthermore, if @ is a nonlinear binary code which contains the zero 
codeword, we define the formal dual @* so as to satisfy: 


(i) (@*)” = @, 
(ii) if @ is linear the two definitions of @” agree. 


The paper is arranged as follows. Section II states the three Mac- 
Williams identities (Theorems 1, 2, 3). Section III treats the binary 
case, when the three theorems coincide. The formal dual of a nonlinear 
binary code is defined in Section 3.5. Section IV treats the general case, 
first proving Theorem 1 and then deducing Theorems 2, 3 from it. 
In Section V we discuss properties of the “weights of the dual code”’ 
B(i). However, the problem of finding conditions for the B(i) to be 
positive integers remains unsolved. 


O: OF 20>. SOF 0" ik ~ 20. be oe se Od. le ea od 
bh. Or. 0, Oo O° Os. GO O. 2G. ay od i a od 
dpe Os ae BD NOE. Be 20> 2:0 Ooo 9-2. oh -d dk. a 
de, SQ» Qi» oat ~0F, 30: "O° 30 Or de oe “Oe abe de. he dk 
Be oP 20) 0 dk OP Be 26 BO? Shr a -a -Or, ae a et 
ae 408 -20:. 20), “Oh. -e ©O:- 70 O’. cd; ae <i ay -@ AY ed 
2 A AOS, BOG 7 AY. HO) “ds 20 Os. is se, Cl. SES ed 90 el 
de 0 OQ Ore Oe. OE Oe) VL Ode at a, tae a 100) 


Fig. 1—The sixteen rows form a nonlinear code @. 
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II. WEIGHT ENUMERATORS 


Let F be a finite field GF(q), where q is a prime power; and let F” be 
a vector space of dimension n over F’. A linear code @ of length n over 
GF(q) is a subspace of F”, and @* denotes the orthogonal subspace or 
dual code of @. A code is self-dual if @ = @*. A nonlinear code is any 
subset of F”. In this paper a code is linear unless stated otherwise. 

We propose to describe the code vectors of a code @ in three ways, 
giving progressively less information (but becoming progressively 
easier to handle). 


2.1 The Complete Weight Enumerator 


Let the elements of F be w) = 0, ;, w2, *** , Wg-1, In some fixed 
order. The composition of a vector v « F” is defined to be 


comp (v) = 5 (So »Sry,%t" y Sia); (1) 


where s; = s;(v) is the number of coordinates of v equal to w; . Clearly 

a-1 

can Se = Mt, 

Let A(t) be the number of vectors v in @ with comp (v) = t. The set 
of integers {A(t)} is the complete weight enumerator of @. 

The first MacWilliams identity relates the complete weight enumera- 
tors of @ and @*. (Ref. 4, Lemma 2.7. See also Refs. 7 and 8.) 


Theorem 1: If @ ts a linear code with complete weight enumerator 
{A(t)}, and its dual code Q* has complete weight enumerator {B(t)}, then 


? ee 1 aq-1 a-l tl 
> B(s)ee’ +++ 253° = 77 2s A(t) Il 6 sews) (2) 
S i@| t 1=0 \7=0 
where the z; are indeterminates and X ts a character on GF (q) (defined in 
Section 4.2). 


2.2 The Lee Weight numerator 


For q = 2 this description coincides with the preceding, and for 
q = 2°, s > 1 it is not defined; so in this section q is assumed to be an 
odd prime power. 

For q prime, we wish to classify the coordinates of the code vectors 
by magnitude. For example, codewords over GF(5) = {0, 1, —1, 2, —2} 
would be classified according to the number of components which are 0, 
the number which are +1, and the number which are +2 (but without 
regard to the actual number which are 1, —1, 2, or —2). 

In general, for g a prime power, let the elements of F be w, = 0, 
Oy tt Wey Wig > W-gr15 °° * y Wai , where w-; = 70; and 6 = a(q Tt 1). 
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Then the Lee weight of a vector ve F” is defined to be 
Lee (v) = (lang ty: 3 bcos , Us), 


where J; = 1;(v) is the number of coordinates of v equal to either w; 
or —w,; . In the notation of eq. (1), 


In(v) = so(v) 
L(v) = sv) + s_v) for «t= 1,---, 6. 


Let A”(t) be the number of vectors v in @ with Lee (v) = t; so that 
{A”(t)} is the Lee weight enumerator of @. 

The second MacWilliams identity relates the Lee weight enumerators 
of @ and @’: 


(3) 


Theorem 2: 
Dd B’(s)es’ «++ 25 


= Ta > » A*(t) I (: + > (X(wjw1) + X(—w; wy) ; (4) 


where {B"(s)} is the Lee weight enumerator for @*. 


(Theorem 2 is believed to be new.) The Lee enumerator is important 
both because it is an appropriate measure for codes to be used in phase- 
modulation communication schemes (see Lee, Ref. 9; Berlekamp, 
Ref. 3, p. 205) and as a compromise in giving much more information 
than the Hamming enumerator, yet requiring only half as many vari- 
ables as the complete enumerator. 


2.3 The (Hamming) Weight Enumerator 


For the rest of the paper let q be any prime power. 
The (Hamming) weight of a vector v, wi(v), is the number of its 
nonzero coordinates, so that 


a-l1 


wiv) = ps si{v). (5) 


Let @ be a linear code of length n over GF(q), and let A(z) be the 
number of vectors v in @ with wi(v) = 7. Then {A(z)} is the (Hamming 
or ordinary) weight enumerator of @. Similarly {B(z)} denotes the weight 
enumerator of the dual code @*. The third MacWilliams identity 
(Ref. 4, Theorem 1) relates {A(z)} and {B(2)}: 
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Theorem 3: 


> Bie = ial Yo AO + @ — Da"'G = 2. (6) 


7=0 
2.4 An Example 


Let @ be the self-dual code of length 2 over GF(5) consisting of the 
code vectors 0 0, 1 2,2 —1, —21, —1 —2. 

The complete, Lee, and Hamming weight enumerators are, respec- 
tively, 


A(20000) = A(01100) 


II 


A(01001) = A(01010) 
A(00011) = 1, 
A” (200) = 1, A’ (011) = 4, 


and 
A(O) = 1, A(2) = 4. 


In this case, X(w;w,) = a’ where a = e°"” = cos 72° + isin 72°. 
Theorems 1, 2, 3 assert (correctly) that 


Zo 212 2081 bt 212-2 + 202-1 = Alen +2, + ee + 2-2 + 2-1) 

+ (2 + aa, + a?e2 + afe_g + a*e_1)(eo + 072, + age + 02-2 + a°Z_,) 

+ (Zo + a2, + atg, + az, + a°e_1)(2o + a2, + 0°, + a72_2 + a2_;) 

+ @o + ae + ae + ae» + a2-1)(@o + 0%: + a%2 + a2» + a°2_,) 

+ (2 + a®2, + az, + azn + a72_1)(Zo + a2, + a°Z, + 072-2 + a2_,)], 
that 

2 + deze = glo + 22, + 2%)” 

+ 4G + @ + a’) + @? + a’)e2) Go + @ +o?) + + a’)e)I, 
and that 
Lspege? e510 cp az) oe 2 2)": 


III. THE BINARY CASE 


All the codes in this section are binary, so that Theorems 1 and 2 
coincide with Theorem 3. 
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3.1 Preliminaries 


Let F = GF(2); let F” be a vector space of dimension n over F. For 
purposes of notation we define a group G which is a multiplicative copy 
of F", as follows. Let x, , --- , x, be indeterminates satisfying zx? = 1 
and 2,27; = 2,2; for 7,7 = 1, --- , n. Then Gis the multiplicative group 


v 


consisting of all products x{'a3’ +--+ 22" where v; is 0 or 1. To each vector 
cS (v1 , V2, a pe) 
in F” we associate the element 
Vv V1 ve Un 


GC = UM VMq *** Ln 


of G. Thus F” and G are isomorphic, and addition of vectors in F” 
corresponds to multiplication in G. 


3.2 Characters 
Let X, , ue F”, be a character of G given by 
Lu(x") = C1) 5 


where a = uv’ is the scalar product of u, v in GF(2). 
Let o; be the set of vectors of F” of weight z. Clearly, 


ni-() 


Xx; = oe 


Veoi 


Let 


(For example, X, = 4%, +a +--+: +42, .) X;is an element of the group 
algebra QG of G over the field of rational numbers Q. 
2, is extended linearly to elements of QG, for example, 


Lu(X,) = D) Lule’). 


Veor 


Note that ,,(X;) is a rational integer, not an element of GF(2). 

Let S, be the group of all permutations of n symbols, i.e., the group 
of alln X n permutation matrices. vz is the vector obtained from v by 
multiplying by the permutation matrix 7. 

Lemma 3.1: 
X7(x") = Ly(u""’) for any win S, . 
Proof: 
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Lie) = (-—1)’, 
a=unv’ =u(vr)’. Q.E.D. 


3.3 Krawtchouk Polynomials 
The Krawtchouk polynomial P,(7) (a polynomial in s) is defined by 


+2" — 9" = D Pee, (7) 


so that 


Poe De ry") = ) sn ee (8) 


7~—?T 


It follows from the definition that 


pS P(t) = 2" bo. (9) 
+=0 
Other properties may be found in Refs. 10 and 11. 
Let J, be the vector with v, = vp = --: =v, = landv,4, = °°: = 
Vr, = 0. 


Lemma 3.2: Ifu has weight s, 
Xu(X;) = P,(2). 


Proof: Since X, is clearly invariant under any permutation in S, we 
may suppose, by (8.1), thatu = J,. 
Consider the formal sum 


» (Xz = se,,( > X<'). 
7=0 1=0 


Now 
» Xz = [[ 1 + a2), (10) 
i=0 j=1 
and 
wd bae) = ff f= 1-48, 
l+z if j=sti,---,n. 
Thus 


Yo, (Xje = +2" — 2). QED. 
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Lemma 3.8: 


(;)P00 = (Srv. 


Proof: By rearranging the binomial coefficients in eq. (8). 


3.4 Definition of B(t) and Proof of Theorem 3 


Let @ be an arbitrary (linear or nonlinear) code, i.e., any subset of F’; 
let A(z) be the number of vectors in @ of weight 7. Define 


GH Ue: 


ved 


@ is an element of QG. Corresponding to @ we define numbers B(?), 
7=0,1,---,n, by 


> x4). (11) 


Bi) = @] & 


Note that B(2) is a rational number, perhaps negative. 
With this definition of B(z) we can now prove the binary version of 
Theorem 3, as follows. Define 


@" = » v", 


ve 


We average @ over all equivalent codes @”: 


Lemma 3.4: 
+ @ = DAM! m—d)!X,. 
reSn i=0 


Proof: Let v be a vector of weight 7 in @. The 7! permutations of the 
nonzero symbols of v leave v unchanged, as do the (n — 7)! permutations 
of the places in which v contains zero. Thus 


Se =i(n—DIX;. Q.E.D. 


reSn 


Lemma 8.8: 


Bij) = : : De MeO): 


Proof: As 7 runs through S, , J;7 runs through j!(n — 7)! copies of o; . 
Q.E.D. 
Proof of Theorem 3: By (8.5), (3.1): 
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B@ = Lat x.( > @’) 


la] j!(™m— 9)! iBa 
= ‘al sat A(a)t! (n — 2)! a by (3.4), 


= Ma AO =X PA) by G2) 


: 


ae > AQP.) by (8.8). 


Multiply both sides by 2’ and sum on j: 


2 Bie! = gy LAO Y Pade 


: 


ae 3 A(Q)(1 +2) — 2). Q.E.D. 


In the next section we show that in the case @ is linear, B(2) is the 
usual weight distribution of the dual code. 
3.5 The Dual Code 


If @ = Yovern Qyt”, dy € Q, is any element of QG for which A(0) = 
we define its formal weight distribution to be {A(z)}, where 


A(i) = Dia, (12) 
l@| = > A(i), (18) 
and its formal dual to be 
@* > Xu(@)2”. (14) 
=a uer? 


It follows from (12) that the formal weight distribution of @* is {B(2)}, 
where 


; 1 
BY) = gj XH). (11 
If @ is a linear or nonlinear code, then clearly (12), (13) give the 


usual weight distribution and total number of codewords, and eq. (11’) 
for B(2) coincides with eq. (11) of Section 3.4. 
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Theorem 4: If @ is a linear code, then the expressions (14), (11’) for its 
dual code and weight distribution of dual code, coincide with the usual 
definitions. 

Proof: If u is in the dual subspace to @, then X, (2) = 1 for all ve @, 
so X.(@) = |@|. If ug @*, then uv” = 1 (modulo 2) for exactly half 
the vectors v ¢ @, so 


%.(@) =0 for u¢g@. 
Therefore from (14), 
Gea yas QED. 
|| ueQ@l 


Combining Theorem 4 with the results of the last section, we have 
completed the proof of Theorem 8 for binary linear codes. 


Theorem 5: LetQ@ = ae pn Qyt", ay © Q, be any element of QG for which 
A(O) = 1, with formal dual @* given by eq. (14). Then 

(2) |@| |e*| = 2°, 

(i) (@*)* =@. 
(Note that by the earlier remarks this theorem includes linear and 
nonlinear binary codes as a special case.) 
Proof: (¢) Set 2 = 1 in Theorem 38. 
(it) From (14), (@*)* = diuer» But", where 


1 1 
u Xu @ ) 
B ia" (@°) 





1 ¥ : 
= Fe Dy @)ula") by (, (14), 
1 w v 
=o DY Xe Dy aw) Lu (2"), 
veRrn werk" 
1 v(utw 
3 Pee Bee 
res 
mee 9” (2 Qu), 
since the innermost sum is zero unless u = w. Q.E.D. 


Remarks: In spite of Theorem 5, eq. (14) is not always a satisfactory 
definition of the dual of a nonlinear code, even in the binary case. 
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For example, Fig. 1 shows a nonlinear code with weight distribution 
A(O) = A(8) = 1, AQ) = A(6) = 7, and 


1 /1 1 
@=14+ 216, +as+ ++ + 2s) + 2; a(t 4 2(L 4 es +4). 

41 \Xe Le 
When the weight distribution is substituted in the right-hand side of 
the MacWilliams identity (6), B(z) is found to be the same as A(z) 
(Ref. 4, bottom of p. 82) so that this code is in some sense self-dual. 
However, although eq. (11) correctly gives the weight distribution 
BO) = B(8) = 1, B(2) = Bb) = 7, eq. (14) gives 

@=l—-ta@tate tats DE tate 
2s81<758 

which seems unsatisfactory. A better definition of the dual of a nonlinear 
code has recently been given by P. Delsarte and J. -M. Goethals (private 
communication). 


Iv. THE GENERAL CASE 


4.1 Preliminaries 
Let gq = p’, f = 1, where p is prime; and let F = GF(q) = {w, = 0, 
Wy, °°* 4 Wei}. Let x’ be commuting indeterminates satisfying 


(mj) (wk) _ (wy tor), 
v; i = 4; " ’ 


and let G be the multiplicative group consisting of all products 
xP ay? +++ 2”, v; ¢ F. To each vector v = (v;, +** , Un) in F” we 
associate the element 7 = 2°” --- 2°” of G; as in Section 3.1, G is 
a multiplicative copy of F”. Let CG be the group algebra of G over the 
complex numbers. 


4.2 Characters 


Let p(x) be a primitive irreducible polynomial of degree f over GF (p), 
and let a be a root of p(x). Then any element \ ¢ GF'(q) has the canonical 
representation 


X= do + Aya + Now” + ee) + Aya, \; ¢ GF(p). 


If GF (q) is considered as an additive group, it forms an abelian group, 
denoted by (GF(q), +), which is isomorphic to the direct product of f 
copies of GF(p); the isomorphism being given for example by 


A (Ao,AL, aes » As-1) 
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A character 9% on GF(q) is a homomorphism from (GF(q), +) to the 
multiplicative group of the complex numbers. Define a fixed character 
on GF(q) by 


sc(h) = §* 
where € = e°*”’”, and 
sc(A + gw) = ght, 
All characters on GF(q) are now given by 
(A) = L(y), all »y e GF(q). 


All of the following depends on the choices of p(x), a, and X; this 
dependence on coordinatization seems inevitable in studying codes 
over GF(q). 

Define a character X, on G by 


xa(2") = sluv") = ofS us) (15) 


where >."_, u,v, e GF(q). These characters form a group isomorphic to G 
(and to F”): LX, <> x". We extend X, to CG by linearity. 


Lemma 4.1: 
Cae) = L(x") foranyreS, . 


The proof is straightforward and is omitted. 


4.3 Generalized Krawtchouk Polynomials. 

Let s = (8 , 81, °°* 5 Sg-1)) t = (40, f, «+: , &-1) be compositions as 
defined in Section 2.1. The generalized Krawtchouk polynomial P,(t) is 
defined by 


I o2 st(own)e)) = pe P,(t)20°2' ects Ne (16) 


i=0 


Let X= >) 2. 
ver" 
comp (v) =t 


Lemma 4.2: 


n aq-1 
TI] Doar ee = DX ere ee ei 
7=0 t 


k=1 


This is a straightforward generalization of eq. (10). For example, 
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expand the product (n = 3, q = 4) 
(af? P2g + xl?P2, + af°Peo + xi’ 2s) 
(03° '2q + ag? P2, + 29°? 2. + 2h" 25) 
(as? P2o + as?? 2, + a3? P22 + 25°25). 
Lemma 4.3: For any composition s let 
—s, ==> =—s, = —3,..-—-—_— 
td Si Goin 22 esate Sey Se Sse, Mery) 
so that comp (u) = s. Then 
,(X.) = P,(t). 


Proof: Consider the formal sum 


Nek ee = 
t 


2 


q-l 
> viz) by (4.2), 


k= t=0 


_ o 


a 
i 


a, 
st . 
= IT Dix@? ye. 
k 


t=0 


ce 


-1 


= L(uzw;)2; by eq. (15), 


=O 


3 


Co 
W 

~ 

o. 


a 
L 


q--1 $l 
fi( (oxo). ) by the form of u, 


t=0 
Pb, e825 5° 
by eq. (16). Q.E.D. 


For a composition s, let (%) denote the multinomial coefficient 
n!/(sols,! ca Ba 


Lemma 4.4: 


I 


™ sel 


(2)P0 = (Jee. 


Proof: Set a, = >+%2) X(w,w,)2; , 80 (16) becomes 
I] ai' = ye P(t) Il a 
Multiply by [][%=3 (ys! and sum on s: 


=(%) Tew" = O ("0 Me Ta. an 
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The left-hand side is 
(aoYo + ary + +++ 1 Og-1Ye-1)” 
which rearranged becomes 


(BoZo + Bir21 Foe EF Ba-1Yo-1)"; (18) 


where 


_ 


q- 


B; = X(w 01) Yr . 


=0 


~ 


Expanding (18) we get 


2d e) is °3 se(owi)y) 2 = » (") ey P,(s) I Yi II 2. 


4=0 = s,t t (19) 
Equating coefficients in (17), (19) gives the result. Q.E.D. 


4.4 Definition of B(s) and Proof of Theorem 1 
As in Section 2.1, let @ be any code in F", with complete weight 
enumerator {A(t)}; and let 


@= dx 


ve 


be the corresponding element of CG. For each composition s define 


Bs) = ial + (20) 


comp (u)=s 


In general B(s) is a complex number. With this definition of B(s) we 
can now prove Theorem 1. 


Remark: If @ is a linear code it follows immediately (as in the proof of 
Theorem 4) that {B(s)} is the composition of the dual code to @. 

We first average @ over all equivalent codes. For a vector u of com- 
position ¢, 


q—-l 


Ser =Te) SY 2 


TeSn i=0 ver 
comp (v)=t 


Set d(t) = [[%! (¢;!). Then 
>» @ = > d(t)A(t)X; . (21) 


TeSn 
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Proof of Theorem 1: 
From eq. (20), 


> X.(@) 


uer" 
comp (u)=s 


|@| B(s) 


= is > Lux (@) 


TeSn 
a by (4.1), 
a ( ou @") by (4.1) 
[u is now the vector defined in Lemma (4.3)], 


1 
=O) ye d(t)A(t)XW(X,) — by (21), 


d(t 
= OHS wre by (4.8), 
= ye P,(s) A(t) by (4.4). 
Multiply both sides by 25° --- 232; and sum over all compositions s. 
Q.E.D. 
4.5 Proofs of Theorems 2 and 8. 
We use the notation of Sections 2.2 and 2.3. 
Proof of Theorem 2: 
In eq. (2) replace z; by 2; for 1 < 7 S 6. Then using eq. (3), we see 
that eq. (2) collapses into eq. (4). Q.E.D. 


Proof of Theorem 3: 
In eq. (2) set 2. = 1, 2; = z fori ¥ 0, and use eq. (5) to obtain (6). 
Q.E.D. 


V. DISCUSSION 


We return to the binary case, which is easier to visualize. 

The Hamming distance between vectors u, v is the weight of u + v 
(the weight of u — v if not binary). Coding theorists are interested in 
the distance structure of a code, not just in its weight structure. For 
linear codes, these are the same; they may also be the same for nonlinear 
codes, as in the example in Fig. 1. The following lemma is obvious. 
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Lemma 5.1: The distance and weight structure of a code @ are the same 
tf and only if the weight structure of @ + v is the same as that of @ for 
allve @. 


A code of this type will be said to have property 5.1. From now on 
we restrict ourselves to such codes. 

A code with property 5.1 clearly contains the vector 0. The element 
of QG corresponding to @ + v is @2”. 

Property 5.1 implies that 


la@| Bis) = do x.(e@) = >> x.(@x") for vea. 


Lemma 6.2: Property 5.1 implies that B(s) 2 0. 


Proof: Take the sum over all ve @ of the equation 
|a| Bis) = 2) Xu(@2"). 


la?’ Bs) = Dix. 2D (@e") 


UtEgs 


> £.4(@) > Lu(2") 


Utd, 


Do (x. (@))”. Q.E.D. 


utos 


Corollary 5.3: If B(s) = 0 then X,(@) = 0 for eachucao,. 


Property 5.1 does not imply that B(s) is an integer. Since by Theorem 
5, >), B(s) = 2”/|@| , B(s) cannot all be integers unless |@| = 2". For 


example, the code (0) has property 5.1, but the B(s) are not all integers. 


ll 


I 


At present we have a satisfactory interpretation for A(s), B(s) if 
> res, @” can be generated by a linear code. (@ need not be linear; any 
collection of vectors with the same weights as the vectors of a linear 
code will give the same average.) It would be very desirable to find an 
explanation for the cases in which A(s), B(s) can be thought of as the 
weight distribution of nonlinear codes. 
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Added to galley proof: 


Since this paper was written, it has come to our attention that 
Neal Zierler (unpublished) discovered the nonlinear MacWilliams 
identity for Hamming weight enumerators in 1966. 
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Stochastic Stability of Delta Modulation 


By ALLEN GERSHO 
(Manuscript received December 3, 1971) 


The discrete-time model of delta modulation is considered for a stationary 
random input process with a rational spectral density, and an auto- 
covariance that goes to zero as the lag approaches infinity. For leaky 
integration, the joint distribution of input and decoded approximation 
processes 1s shown to approach a unique stationary distribution from any 
initial condition. Under the stationary distribution, the decoded process 
may take on all values in a bounded interval that 1s independent of the 
input process. For the often-studied ideal integration model of delta modula- 
tion, tt is shown that the successive distributions at even parity time instants 
converge to a limiting stationary distribution, while at odd parity time 
instants the distributions converge to a different limiting distribution. 
Under these limiting distributions, the decoded process is assigned a 
positive probability for each level of a (discrete) lattice of amplitudes. The 
mean-absolute approximation error and mean-absolute amplitude of the 
decoded process are shown to be finite under the limiting distributions. For 
both ideal and leaky integration cases, an explicit upper bound on mean- 
absolute approximation error is given, which ts independent of the spectral 
density of the input process. 


I, INTRODUCTION 


In spite of the great simplicity of delta modulation as an analog-to- 
digital encoding technique, it has not yet succumbed to an adequate 
mathematical analysis. Although realistic inputs such as speech are 
extremely difficult to characterize, considerable insight could be ob- 
tained from a thorough analysis for the case of a stationary random 
input process with a prescribed spectral density. Yet no such results 
have been obtained because of the mathematical complexity of the 
nonlinear feedback loop. In fact, the presence of a feedback loop raises 
the possibility that instability in some sense could arise. The possibility 
that the decoded signal could “run away”? or become unbounded, 
failing to track the original signal, has never been theoretically excluded. 
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Although experience with delta modulation shows that such an extreme 
form of instability never arises, it has never been shown analytically 
that the mean-square or mean-absolute quantizing noise has a finite 
upper bound. Another possibility which has never been theoretically 
excluded is erratic operation, where the statistical average of the 
quantizing noise magnitude continues to vary with time. In other 
words, although the input process is stationary, the decoded approxima- 
tion process would be nonstationary with a time-varying probability 
distribution even after low-pass filtering. If this were the case, the 
decoded process would not be replicating the original process very 
effectively. 

Recently, D. Slepian’ has developed an exact computational approach 
for finding the joint probability distribution of the original and encoded 
processes. These results make it possible to accurately compute such 
curves as the mean-square quantizing noise versus step size for particular 
spectral densities of the input process. Slepian’s results are based on 
the initial assumption that, for a stationary input process with rational 
spectral density, the joint probability distribution will approach a 
unique stationary distribution from any starting condition. (For delta 
modulation with ideal integration, the stationary distribution actually 
refers to half the sum of the distributions at two successive time instants 
to account for the well-known parity change between even and odd 
amplitude levels.) On a practical level, this stationarity assumption 
seems to be entirely reasonable; yet it has never been theoretically 
justified. 

Other authors have also assumed stationarity. In particular, for ideal 
integration H. van de Weg” assumed implicitly that the decoded process 
had two different stationary distributions for the even and odd parity 
time instants. D. J. Goodman’ assumed a random phase initial condition | 
so that only one stationary distribution, half the sum of the even and 
odd parity distributions, need be considered. In both cases the assump- 
tion is implicitly made that for any initial condition the delta modulation 
process will approach a steady-state mode of operation with a separate 
stationary distribution for even and odd time instants. 

As a final argument to point out the need for an analysis of stochastic 
stability properties of delta modulation, consider the fact that most 
heuristic and semianalytical approximate considerations of delta 
modulation are based on the model of an ideal integrator in the feedback 
path, while most physical realizations involve a leaky integrator. There 
is a basic qualitative difference between these two cases, even for ex- 
tremely wide-band integrators. This is because ideal integration gives 
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equal weight to a current input sample and an arbitrarily remote past 
input sample, while leaky integration forgets remote past samples. 
Hence, it is not clear that the ideal integration model is meaningful, 
even if it is known that the leaky integration system is well behaved. 
To justify the validity of using an ideal integration model, a rigorous 
demonstration of the stability of this model is needed. 

This paper demonstrates the stochastic stability of delta modulation 
for both ideal and leaky integration by giving a mathematical proof 
that the joint distribution of the input and decoded processes approaches 
a unique stationary distribution from an arbitrary starting point and 
by deriving an explicit, finite upper bound on the mean-absolute quan- 
tizing noise. The input process is assumed to be stationary and con- 
tinuously distributed, with finite variance, a rational spectral density, 
and an autocovariance that approaches zero as the lag goes to infinity. 
A shaping filter with white noise input is assumed as the generating 
mechanism of the input process. 


II. PROBLEM FORMULATION AND SUMMARY OF RESULTS 


For most purposes the following discrete-time model of delta modula- 
tion is an acceptable description of the actual continuous-time opera- 
tion.* Let u, denote the sampled analog values of the input process at 
successive time instants ¢ = 0, 1, 2, --- . The time scale is normalized 
for convenience without loss of generality. The delta modulator shown 
in Fig. 1 generates a binary-valued process b, according to 


b, = sgn (ux — 2) (1) 


where sen y = +1 if y 2 O and —1 if y < 0, and 2, is the decoded 
process which approximates u,. The decoded process 2; is given re- 
cursively by 


Iyer = X + Asen (u, — Tr) (2) 





INTEGRATOR 


Fig. 1—Delta modulator. 
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for the ideal integration case, and by 
Lev1 = at, + Assen (uy, — z,) (3) 


for the leaky integration case with a simple RC integrator, where A > 0 
is the step size and 4 S a < 1. (In practice, a is very close to one because 
the integrator time constant is much longer than the sampling period.) 
The quantizing noise, 


Ce = Ur — Uy 


is the error at time k due to the analog-to-digital-to-analog processing 
of the delta modulation system. 

Assume the input process u, is stationary, continuously distributed 
with finite variance, and has autocovariance approaching zero as the 
time lag goes to infinity. For convenience, assume also that the prob- 
ability density of u, is everywhere positive so that, as in the Gaussian 
case, there is a positive probability of u, lying in any open interval. 
Note that u, is not assumed to have zero mean value. 

A summary of the results obtained in this paper follows. Let /, 
denote the joint distribution of the input and decoded processes u, and 
x, at time k. 


2.1 Ideal Integration 


(t) For any initial condition of the form z, = mA + @ with m an 
integer and |@| < A/2, the two sequences of distributions {f.,} and 
{Foxsi} Separately converge to unique stationary distributions G, 
and G,, respectively. One distribution assigns positive probability for 
the process x, — 6 to even integer multiples of A, the other distribution 
to odd integer multiples, depending on the even-odd parity of the 
initial integer m. 

(77) With these stationary distributions the mean-absolute quantizing 
noise, averaged over two successive time instants, is bounded according 
to 


3B | a, — ue | + EB | tear — wei ||] S Bl us| + A/24+ 2] 6]. (4) 
2.2 Leaky Integration 


(i) For any initial value of z,, the distributions {/,} converge to 
the unique stationary distribution G, under which x, may take on all 
values in the range 


A 
ees (5) 


Qa 
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(iz) Under this stationary distribution, the mean-absolute quantizing 
noise is bounded according to: 


E\|x,—w|S E|u,| + A/2e. (6) 


Note the qualitative difference between the leaky and ideal integration 
cases. In the leaky case, x, is distributed over a finite interval; in the 
ideal case, x, is discretely distributed on a lattice. 


II. MARKOVIAN MODELING 


Since the input process u, is stationary with finite variance and 
rational spectrum, then &, = u, — wu (where yu is the mean value of u,) 
can be modeled’ as the response of a stable discrete-time shaping filter 
to a zero mean, finite variance ‘“‘white noise’’ process w, (with w, inde- 
pendent of w,-; for z = 1, 2, 3, --- ). More precisely, a white process 
w, and a stable rational shaping filter H(z) can always be specified in 
such a way that the response @, of the filter to the excitation w, will 
be a stationary random process with spectral density identical to that 
of &, . If w, is also chosen to be continuously distributed with a positive 
density, then @, will also satisfy this property. Thus, all the assumptions 
made in Section II about the process u, are possessed by the process 
@, + u. It is therefore reasonable to study the effect of the delta modula- 
tion system for the input @, + u, whose structure or generating mecha- 
nism 1s known. For the remainder of this paper, no distinction will be 
made between &%, and @, . 

Using this model and the assumption that the autocovariance of u, 
goes to zero as the lag approaches infinity, Appendix A shows that 
u, can be imbedded in a vector Markov process, d, , with 


d, = (dia ) diz por ey din)’ 
and 
Un = diy + fad (7) 


where » denotes the de value of the input process and n is the number 
of poles in the shaping filter. The vector d, characterizes the state of 
the filter at time k and is generated by the recursion 


d.., = Ad, + bw, (8) 


where A is an n X n matrix with eigenvalues within the unit circle, 
and b is a fixed vector. The process d, is Markovian, since the condi- 
tional distribution of d,,, given all past states d,, d,-1, --- , depends 
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only on the given value of d, . Appendix B shows that for any initial 
state dy, the distribution of d, approaches a unique stationary distri- 
bution. Equations (2), (7), and (8) for the ideal integrator case, or 
eqs. (3), (7), and (8) for the leaky integrator case, jointly characterize 
the evolution in time of a Markovian process whose state s, at time k 
is given by the n + 1 component vector 


S, = (ay , Uy, Axo , Aes , at » dn)’. 


Then, given the value of s, , the distribution of s,,, is completely 
determined. Henceforth, a distribution F, , describing the joint distribu- 
tion of the n + 1 components of the vector s, , will be regarded as a set 
function which assigns a probability F(A) to the region A of the n + 1 
dimensional space of possible values of the state vector s, . The prob- 
ability transition function’ characterizing the Markov process is defined 
as 


p(s, A) = P{sisi1eA | Ss, = s} 


which is independent of k. By averaging this conditional probability 
over a distribution F’, assigned to s, , the unconditioned distribution 
F441 Of Sz41 18 obtained: 


P(A) = | ls, Aids) = Bryp(s, A). @) 


Thus, F,,; 1s related to F, by a linear mapping 7’, so that in operator 
notation 
Pryaa = TF, . (10) 


Note that 7 plays the same role as the probability transition matrix 
in Markov chains. The process has a stationary distribution G, if G = TG, 
so that G is self-reproducing. If any state vector has distribution G, all 
subsequent state vectors will have this distribution. The existence of 
a stationary distribution is a necessary but not sufficient condition 
for the convergence of the distributions F,,, to a limiting distribution. 

The Markovian model will be used in Sections V and VI to obtain 
the convergence properties of the distributions {F,}. But first, it is 
necessary to obtain a bound on the time and ensemble average of the 
quantizing noise. 


IV. BOUNDING THE TIME-AVERAGED MEAN-ABSOLUTE QUANTIZING NOISE 


Suppose that at the initial time instant k = 1, the system has an 
arbitrary initial state s,.. Appendix B shows that for any initial state 
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d, of the shaping filter, the process d,, and hence u,, converge in 
distribution and £ | u, | converges to c, the mean-absolute value under 
the stationary distribution of the process wu, . 

Squaring both sides of eq. (8) gives 


ax, — 2aA | uy, — x | + 2adu, sgn (uy, — 2,) + A’ 
S x, — 2A |u, — x | + 2ad|u, | + A’. 


2 
VUn+1 


Taking the expected value of both sides yields 
Exi.; < Ex — 2adE | e, | + 2aAF | uz | + A? 


and iterating backwards gives 


2 
Eris. S 


ai + 2ad + Blu |— Bh |) + A’k. 
Since the left side is nonnegative, it follows that 
bY Bl e:| Saf/2aak +2 OB |u| + A/2a. (11) 
Hence, using the fact that E | u, |— c, 
lim sup YE le; | Sc+ A/2a, (12) 


which shows that the long-term time average of the mean-absolute 
quantizing error is bounded. Since the preceding derivation holds for 
a SX 1, it applies for both leaky and ideal integration, setting a = 1 
for ideal integration. 

For leaky integration, the decoded process is, in fact, bounded 
deterministically. Integrating eq. (3) backwards yields 


k 
e+ = aa, +A > a "bd; (18) 
t=1 
so that 
A 
| t+ | Sala |+ 7-7 - @). (14) 
—a 
Hence, 





lim sup | a, | S (15) 
ko 


tea 


Thus, the decoded process is bounded with probability one in the case 
of leaky integration. 


828 THE BELL SYSTEM TECHNICAL JOURNAL, APRIL 1972 


V. EXISTENCE OF A STATIONARY DISTRIBUTION 


A sequence of random vectors and the corresponding sequence of 
distributions are said to be stochastically bounded if, for any probability e, 
however small, there is a sufficiently large distance R such that each 
random vector of the sequence has probability less than e¢ of having 
length greater than R. Hence, the successive vectors cannot have a 
positive probability of moving out toward infinity. 

For a sequence of distributions {F',}, define the associated sequence 
of averaged distribution {G,} by 


G,(A) = 2 > FAA). (16) 


Thus, if J is a randomly selected time instant from the first k integers 
each having equal probability, then G,(A) is the probability that the 
random vector s; lies in a region A, where s; has distribution F;, . 
If the sequence {F,;} is stochastically bounded, then the averaged 
distributions {G,} are also stochastically bounded; however, the converse 
is not always true. 

To show that the Markov process s, defined in Section III has a 
stationary distribution G, the following theorem, proved in Appendix C, 
may be used. 


Theorem: A Markov process has a stationary distribution if 


(¢) for any initial state s,, the averaged distributions G, are sto- 
chastically bounded, and 

(iz) for any region A, let D be the set of points s at which the transi- 
tion probability function p(s, A) is discontinuous and let NV; be 
the set of all points whose distance from D is less than 6; then 
there is a function C(6) independent of s with C(6) — 0 as 6 > 0 
and for all s, 


p(s, Ns) S C(6). (17) 


Condition (2) excludes the possibility that successive state vectors 
can move out toward infinity. Condition (7) is concerned with the 
region of state space where an arbitrarily small perturbation of a given 
state vector can cause a substantial change in the induced distribution 
of the state vector at the next time instant. This region is contained 
within the set N; for each 6 > 0. The condition requires that this region 
be suitably unimportant. 

To show that the delta modulation process satisfies condition (2), 
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observe that 
E\a;|s FElu;-—a|+#|u; | 


and since E | u; | is bounded, eq. (11) shows that for some constant K, 
1 k 
i=1 


Now Chebyshev’s inequality, 
MP {|2;|>M} S Ela; |, 
applied to eq. (18) yields 


pve 


for every M > 0, which shows that the averaged distributions of the 
decoded process 2x, are stochastically bounded. But from Appendix B, 
the d, process is stochastically bounded. Hence, the averaged distribu- 
tions of d, are also stochastically bounded. Thus, the marginal distribu- 
tions of the joint distributions G, are stochastically bounded so that 
G, is itself a stochastically bounded sequence, and condition (2) holds 
for the ideal integration case. For the leaky integration case, condition (7) 
is satisfied since eq. (15) shows that the x components of the vectors s, 
are uniformly bounded with probability one, so that the above argument 
shows that the joint distributions G, are stochastically bounded. 

To verify condition (72), note from eqs. (2) or (3) that for any region 
A, p(s, A) is continuous, except in the set D of all points s the first two 
components of which, x and u, are equal. Appendix D shows that given 
Ss, , the variate u;,,, is continuously distributed and that this implies that 
there’ exists a function C(6) which goes to zero as 6 approaches zero and 


Pil x — ul | se} S CO) (19) 


where C'(6) is independent of z and s,. But since 2,,;, is completely 
determined by s; , (19) implies that 


P{| Le+1 — Une+i | <6 | S;} = C(6). (20) 
Hence, eq. (17) is satisfied and condition (22) holds for both ideal and 
leaky integration. Therefore, a stationary distribution exists. 
VI. ALLOWABLE AMPLITUDE VALUES FOR THE DECODED PROCESS 


For ideal integration, it is clear from eq. (2) that an initial condition 
of the form 
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1 =mA+ é 


with m an integer and | @| < A, implies that all subsequent amplitude 
values of the decoded process will be confined to the lattice 


goSIAAG. TS Op ae: 


Since the preceding results did not specify a particular choice of initial 
condition, it follows that for each 6, a stationary distribution G, exists. 
For convenience, assume @ = 0. No loss of generality will result because, 
for any 6, the problem can be converted to the 6 = 0 case by replacing 
the input process u, by u, — @ as can be seen from eq. (2). This simply 
changes the de value of the input by, at most, one step size A. 

Under the assumption that u, has a positive density, it follows 
from eq. (2) that there is always a positive probability of either increas- 
ing or decreasing by A in going from z, to 2,,,. This means that every 
integer multiple of A must have a positive probability under the station- 
ary distribution, because each level can always be reached from any 
other level in a finite number of steps. (On the other hand, if u, were 
bounded, then eq. (2) shows that the decoded process would get locked 
into a bounded set of levels from which it would never escape.) 

For leaky integration, the situation is strikingly different. The 
decoded process will get locked into the bounded region 


X = {z: —A/(l— a) Sx s A/(1—a)}. (21) 


This may be seen by noting from eq. (3) that if 2, is in X, then z,,, 
must also be in X. Consequently, once zx, is in X, it will never escape. 
If x, is not in X, then eq. (3) shows that 


| ters | < | a | 
and if b, has the correct polarity, 
| Uk+1 | <a | Lk | ‘. (22) 





Hence the values | x,,; | must decrease monotonically as long as 2,,,; 
remains outside of X. Since u, has a positive density, b; must have 
positive probability of having either polarity; which means the stronger 
inequality eq. (22) must hold at some subsequent time instants. Hence, 
the process must eventually enter the region X. 

Iterating eq. (3) backwards yields 


k-1 


Lne1 = aa, ok >, aby; . (23) 
7=0 


This shows that the initial value z, is gradually forgotten and the set 
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of allowable values for x, as k goes to infinity, approaches the set 
W = {a:2 = A(+t1 ta ta’ + ::: +} (24) 


where all possible sequences of polarities are used to generate values 
1 


of x. Appendix E proves that, for a 2 3, the set W coincides with the 
interval X.* 

For the leaky integration case, the stationary distribution G clearly 
must confine the x component of the state vector to the region X. 
Furthermore, the following argument shows that G assigns a positive 
probability to every open subinterval (y — ¢, y + e) with y in X and 
e > 0. Let {c;} be a suitable binary sequence (generated as in Appendix E) 
satisfying 


y= A dice’. (25) 
i=0 

Pick the integer N large enough so that 

aN*'y, < ¢€/8, and a” A/(1 — a) < &€/3, (26) 
and consider the event 

E= {be = Cy-1 , 63 = Cy-2,°°' , bve1 = Co}, 

which has a positive probability for any initial state s,. Then for 
k = N + 1, eq. (28) can be written in the form 

Ure2 ~ Y= aN tty, + Aa b, —A > ac; 

i=N 


so that, using eq. (26), 
| tvez — y| S ¢/8 + €/3 + €/8 = «, 
for x, in X. Hence, 
P{| tye. —y| <€[8i} 2 P{E| s:} (27) 
and averaging eq. (27) over the distribution G for the state s, shows that 
P{| tv12 —y| < ef > 0 (28) 


where zy+2 has the marginal distribution of the first component of G. 


VII. CONVERGENCE TO THE LIMITING DISTRIBUTIONS 
The following specialized and paraphrased version of a theorem due 
to J. L. Doob’ is suited to the delta modulation process: 


* For a < 3, it can be shown that W does not coincide with X, in fact W is not 
even dense in X. 
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Doob’s Theorem: Suppose the Markov process has a stationary distri- 
bution G and satisfies the conditions: 


(1) For any initial state s,, if A ts a region with G(A) = 0, then 
F(A) > 0. 
(iz) If a region A satisfies p(s, A) = 1 for s in A, then G(A) = 1. 
Then either 


F,—G _~ for any initial state 


in which case the process 1s aperiodic, or there exist m disjoint sets 
Ay, Ai, °°: ,Am-1 with GWU A;) = Land p(s, Ais1) = Lifsis in A; 
fori? = 0,1, --- ,m — 2 and af sis in A,-1, then p(s, Ao) = 1. In 
this case, the process is said to be periodic with period m. In particular 
for m = 2, there exist two distributions Gy and G, with G = 4(Go + G,), 
G,(A,) = 1, G(Ao) = 1, and for any initial state in Ay, Fors, 2 G, 
and Fy, — Gy , while for any initial state in Ay , Fora: > Go and Fo, > G,. 

For ideal integration, Section VI shows that there are no transient 
levels for the x, process and Appendix A shows that d, has a positive 
probability of lying in any region of n space with nonzero volume. 
Hence, there is no transient set A with G(A) = 0 except for trivial 
sets with F(A) = 0 for each k. For leaky integration the only nontrivial 
sets A are regions where the x component lies outside of X and for 
such regions, Section VI shows that F(A) — 0. Therefore, condition (7) 
is satisfied for both types of integration. 

Furthermore, the ergodicity requirement (72) is also seen to be satisfied 
for both ideal and leaky integration from the results of Section VI. 

For ideal integration the process clearly has period 2, since, if Ao is 
the set of all state vectors with x components taking on even integer 
multiples of A, and A, is the complementary set, then Ay U A, has 
probability one under the limiting distribution and the transition 
probability function has the requisite property implying the state 
vector alternates between A, and A,. For leaky integration the process 
is aperiodic since eq. (27) holds for all N sufficiently large so the process 
cannot satisfy the requirements for periodicity, hence F, — G. Since 
a sequence of distributions cannot at the same time converge to two 
different distributions, it follows that the stationary distribution G is 
unique for both ideal and leaky integration. 


VIII. BOUNDING STEADY-STATE QUANTIZING NOISE 


The fact that a sequence of distributions converges to a limiting 
distribution does not imply that moment functions such as mean or 
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variance converge to the corresponding moment of the limiting distri- 
bution. However, it does imply that the sequence of expectations of a 
bounded continuous function converge to the expectation of that func- 
tion under the limiting distribution. It turns out that this property 
is sufficient to obtain a bound on the least mean-absolute quantizing 
error under the stationary distributions. 

For ideal integration, eq. (12) can be rewritten in the form 


lim SUP Om = py E(| ex, | + | ees |) S (29) 


where K = c + A/2, which implies the existence of an even subsequence 
of time instants t; (= 2k;) with 


2E(le. | + | esi |) SK. (30) 
Now define the truncating function Jz(e) according to 
| 1 le| SR 
Jr(le) = 31 —(le| —R)/6 R<le|<R4+6. 
| 0 R+68|e| 


Then, 
E\| C1; | =F | G:r+1 |) 2 E{| C1; | Jrl€r;) + | Cr5+1 J rl€r:+1)} (31) 


and, since the right-hand side is the expectation of a bounded continuous 
function, 


E{| €4; | J r(é:;) ie lesa | J r(C:;41)} — Ky | €; | J 2(e;) 
+ #, | C741 | J n(€i41) (32) 


where j denotes an even time instant in steady-state operation and 
E, and E, denote the expectation under the distributions Gp) and G, 
respectively, or reversed, depending on the parity of 2, . Since eq. (32) 
holds for each positive R and 6, taking the limit as 6 > 0 and K — 
shows that 


3(By |e: | + Bi | ess: |) S ¢ + A/2. (33) 


Thus, the mean-absolute quantizing noise averaged over two consecutive 
time instants has the bound c + A/2 under the limiting distributions 
for ideal integration. 

For leaky integration, the process x, is bounded with probability one, 
so that in this case all moments of x, converge to the corresponding 
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moment under the limiting distribution. Furthermore, in Appendix B, 
it was shown that the first absolute moment of u, converges to the 
corresponding value under the limiting distribution. Together, this 
implies that E | e; | converges to Eg | e; | . Hence, eq. (12) yields 


Eg |e: | S ¢ + A/2a, (34) 


This result could also have been obtained by the same argument used 
to derive eq. (33). 

Note that the bounds, eqs. (33) and (34), are independent of the 
spectral density of the input process u, and are therefore very crude 
bounds. An important feature is that the mean-absolute quantizing 
noise is shown to be finite under the stationary distributions. An im- 
mediate consequence is that the decoded process 2, has a finite first 
absolute moment for ideal as well as leaky integration. Since 


E\ez;|=Ele;-tu:|sS Ele |+#lu: |, 
then 
Eo | a; | $ 2¢+ A/2a (35) 


for both leaky and ideal integration. (Set a = 1 for ideal integration.) 
Possibly of interest also is that this bound may be used to obtain upper 
bounds on the tail probabilities of the decoded process by using the 
Chebyshev inequality. 

As discussed in Section VI, ideal integration with initial values of 2, 
of the form mA + @ can be handled by replacing u, by u, — @, so that 
the bound, eq. (35), remains valid if c is replaced by c + | 6 | , and 
|e; | by |e; | — | 6|, which leads to the inequality, eq. (4). 


Ix. CONCLUSIONS 


The results of this paper show that delta modulation indeed possesses 
the qualitative properties of convergence to a stationary distribution 
and boundedness of the quantizing noise. Perhaps of greatest interest 
is the fact that the results also hold for ideal integration, thus justifying 
the study of this idealized model to obtain an understanding of the 
usual physical situation of leaky integration. 

The use of a Markovian model of the input process has been con- 
sidered by several authors* "’ as an approach to determine actual 
probability distributions for the steady state. The results of the paper 
show that the Markovian recursion, i.e., the usual Chapman—Kolmogorov 
equation, will, in fact, converge to the unique stationary distribution. 
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The technique used here for showing the existence of a stationary 
distribution (an invariant solution to the Chapman—Kolmogorov 
equation) extends the method used by this writer’ for an adaptive 
filtering algorithm and the earlier results for Feller processes.'*'** 


APPENDIX A 


Markovian Imbedding 


Suppose initially that the process u, has zero mean. Then u, is 
generated by the recursion 


Unen = UUsin-r ote HH Ante TH BiWeen-i HF o°8 + 1 Bae (36) 


with 8 # 0. This equation describes the operation of a stable shaping 
filter with transfer function 


Pen ee ee 
A(z) n s n—t 
a> Qe 


with A(z) having all roots inside the unit circle, |z| < 1, and w, is a 
white process with zero mean, finite variance and an everywhere-positive 
probability density function. The requirement that the autocovariance 
goes to zero for lags approaching infinity is satisfied by the fact that 
B(z) is of lower degree than A(z). 

The state vector d, is defined by 


diy, = Ur — B (37) 
disiye = Gijnsi — Ow, @=1,2,°:+,n—-1 

which when combined with eq. (86) leads to the state equations 

Gnas = Ad, + bw, (88) 
characterizing a vector Markov process. The matrix A is given by 

0 1 0 vee 0 

0 0 10 ::: O 

Avs 
0 0 O 1 


Qn, An-1 cae Oy 
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and the constant vector b = (61, b2, --: , b,)’ is determined by solving 
the equations: 

by = Bi 

b; = B; + a,-1b; + +++ + a2b;-. + a,b;_, for 7 = 2,3, +--+ ,n. 


The matrix A is stable since its eigenvalues are the roots of A(z). This 
state representation is a standard one in the control literature. See 
for example, Ref. 15, p. 221. 

The polynomials B(z) and A(z) may be assumed to have no common 
roots so that the shaping filter is intrinsically of order n. Then, the 
state generating eq. (38) is known to be completely controllable.* This 
means that, for any initial condition at time k, it is always possible 
to find values for w,, Wri1, °** » Wesn-1 to produce any desired value 
of the state vector d,,, . It follows that since w, has a positive density, 
the state vectors d, have a positive probability of lying in any region 
of n-space with nonzero volume. 


APPENDIX B 


Convergence in Distribution of d;, 
The Markov process defined by eq. (88) can be iterated to obtain 


k 
ra Pee = A* d, + > A* ‘bu; . (39) 


i=1 


Since A has all eigenvalues of less than unit modulus, A* — Oask— ~, 
so that the first term on the right side of eq. (89) goes to zero with 
probability one. The second term has the same distribution as 


k-1 
vo, = » A'‘bu,; 
7~=0 


since the variates w, are independent and identically distributed. But 
v, is a martingale,”® since 
E{vys; |} = Ue 


and 
k-1 . b 
Ella lls DM WON Elo | < ee wy 


is finite, where \ denotes the Euclidean norm of A, \ < 1. Then by the 


* See Theorem 7-8, p. 389 of Ref. 15. 
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martingale convergence theorem, v, converges with probability one to a 
random variable v,,. Hence, the probability distribution of the vector 
d,.4: converges to the distribution of v,, , where 


Uo = >, A’bw; . (40) 
i=0 


Since w, is uncorrelated and has finite variance, it may be seen from 
eq. (40) that v, also has finite variance. This, together with the con- 
vergence in distribution, implies (Ref. 6, p. 252) that the mean absolute 
value of each component of d, converges to the mean absolute value 
of the corresponding component of v,, . Consequently, 


E|u,|—-e 


for any initial state d,; , where c is the mean absolute value of the first 
component of v,, . 


APPENDIX C 


Existence of a Stationary Distribution 


Theorem: A Markov process has a stationary distribution tf 


(i) for any initial state s; the averaged distributions G,, are stochastically 
bounded, and 

(i1) for any region A, let D be the set of points s at which the transition 
probability function p(s, A) ts discontinuous and let N; be the set 
of all points whose distance from D 1s less than 6; then there is a 
function C(5) independent of s with C(6) 0 as 6 > 0 and for alls, 


p(s, N3) = C(6). 


Proof: Since the sequence of averaged distributions G, is stochastically 
bounded, by the Helly selection theorem, Ref. 6, p. 267, there exists a 
subsequence G,; converging to a limiting distribution G. From the 
definition of G, and T it follows that 


TG, = G+ > Pan ae F,) 
so that 
| 7G,,(A) — G,(A) | $0 


as7— o for any region A. Since G,, converges to G, it then follows that 


TG,; 2G tL ©, (41) 
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It remains to show that 
TG,, — TG (42) 


so that eqs. (41) and (42) will imply that G = TG, which is the desired 
result. 

To prove (42), note that G,,; — G implies that for any bounded 
continuous function ¢ of the state vector (s), 


E.g(s) > Eeg(s) (43) 
where EH; is the expectation under G,,;. If p(s, A) were continuous 


in s, eq. (42) would follow from eq. (48) by setting y(s) = p(s, A) 
and noting from eq. (9) that 


E.p(s, A) = TG,,(A). 


However, p(s, A) is not itself continuous and the following argument 
is needed to complete the proof. 

Let I;(s) denote the function which is equal to 1 if s is not in N,, 
and for s in N; let J;(s) denote the distance of s from D. Then 
I;(s)p(s, A) is bounded and continuous in s for any region A, and so 


E;{I,(s)p(s, A)} > Ee{Is(s)p(s, A)}, ot &. 


But 

E;p(s, A) — E;{Is(s)p(s, A)} S GCN) (44) 
and also 

Egp(s, A) — EetIs(s)p(s, A)} S GN). (45) 
But since 


p(s, Ns) = C(8) 
by hypothesis, it follows by averaging over s that 
FANs) S C() 


and therefore 


G.(Ns) S C(6) (46) 
and, since G,,(N;) — G(N;), then 
GN) S C(6). (47) 


Combining these results shows that 


lim sup | E,p(s, A) — Egp(s, A) | < 2C(8), 
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but since 6 can be made arbitrarily small, it follows that 
E;p(s, A) = Egpis, A). 
Hence, eq. (42) holds and the theorem is proved. 


APPENDIX D 
Existence of C'(6) 
From eq. (88), it follows that 
Unsi = (Ad): + bux 


with 6, ¥ 0, so that the conditional distribution of u,., given s, is 
continuously distributed because w, is continuously distributed. Let 


1 = (Ad,)1 , and let 
H(z) = P{byw, < 2}. 
Then 
P{tsi1 <av|s.} = A(z - J (48) 
which is a uniformly continuous function of xz. Therefore, if 
C(8) = sup [H(@ + 6) — H@ — 8] 
then 


C(6) >-0 as 6-0. 
But 


P{lu,—2|< 6|s,} = H(@+l+ 6) —- H(e¢+1—- 6) S CC) 
using eq. (48), which proves the existence of a suitable function C(6) 
independent of s, and z. 

APPENDIX E 
Range of the Mappingy = +4ltataxt::: 


Theorem: For 3 S a < 1, the range of values taken on by the mapping 
i Dy ab; (49) 

i=0 
for all binary sequences, b; = +1 each 1, is the closed interval | y| S$ 1/(1 — a). 


Proof: For each y in the interval | y | S a, witha = 1/(1 — a), generate 
a binary sequence bo, b,, b2, --- according to the algorithm below. 
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Let 
p= 3 + a). 
If p = 1, let fo = 1 otherwise f, = 0. Let 


n 


s = doa't: n=0,1,2,--- 
For n = 1, 2,3, -::,ifp — s, 2 a” let fas, = 1, otherwise f,4, = 0. 
Then 

b; = 2f; — 1, 4 =O. 1 2, aes, (50) 


To prove that eq. (49) holds for the binary sequence generated in this 
manner, note first of all that for all n = 0, 


IIA 


p 


so that s, — s for some number s with s S p. Suppose that s < p. 
Then there exists a largest integer m satisfying 


Pp < Sm-1 + a”, (51) 
Therefore, f,, = 0, and fxs; = 1 for each 7 > 0. Consequently, 
p > s= so + att -- sre + ppm 


Sn = Sn+1 ) and Sn 


so that 
9 > Sm-1 + a” **/(1 — a). (52) 
But eqs. (51) and (52) imply that 





i =< <a 
so that 
a< 3) 
which is a contradiction. Therefore, 
p= > af i 


and so 
y= D> a'(2f; —aj= 2, @'b; ’ 


7=0 


which proves the theorem. 
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Perspective Drawing of Surfaces With 
Hidden Line Elimination 


By N. Y. GRAHAM 
(Manuscript received February 18, 1972) 


An efficient computer algorithm ts described for the perspective drawing 
of a wide class of surfaces. The class includes surfaces corresponding to 
single-valued, continuous functions which are defined over rectangular 
domains. The algorithm automatically computes and eliminates ‘‘hidden 
lines.”” The number of computations in the algorithm grows linearly with 
the number of sample points on the surface to be drawn. An analysis of 
the algorithm is presented, and extensions to certain multi-valued functions 
are indicated. The algorithm is implemented and tested on two different 
computers: a large central computer with hard-copy capability, and a 
small laboratory computer affording interactive use. Running times are 
found to be exceedingly efficient on both machines. Interactive implementa- 
tion of the algorithm, with on-line scope display and view-potnt control, 
enables effective and rapid examination of a surface from many per- 
spectives. 


I. INTRODUCTION 


The general problem of efficiently and meaningfully displaying 
three-dimensional objects in two dimensions is central to computer 
graphics. In particular, the “hidden line problem” of drawing objects 
in perspective while eliminating line segments not visible from the 
viewing point, is one of long standing. In recent years various algorithms 
dealing with this problem have appeared in literature.’~’ Some of these 
algorithms entail prohibitively long computation time (relative to 
the number of data points) or have large storage requirements; some 
have both of these undesirable characteristics. 

In this paper a detailed description will be given of a highly efficient 
algorithm for the perspective drawing of an arbitrary surface which 
corresponds to a single-valued continuous function defined over a 
rectangular domain, with elimination of hidden lines. The vantage 
point from which the surface is viewed may be any point not on the 
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surface. Some generalizations of the algorithm to nonrectangular 
domains and to multiple-valued functions will be indicated. 

This algorithm has been implemented on both a large and a small 
computer (Honeywell 6070, DDP 516) with considerably shorter 
computation time than that based upon previous algorithms for drawing 
similar surfaces. A very useful feature of the implementation on the 
DDP 516 (a 16K laboratory machine with 16-bit word length) is that 
the program may be run interactively, allowing the user to select 
varying vantage points and rapidly display different views of a given 
surface. 


II. STATEMENT OF THE PROBLEM 


Given a three-dimensional surface S defined over a rectangular 
domain R, and a vantage point V not on S (Fig. 1), we may assume 
that R is centered about the origin of the x, y-plane and is oriented 
so that its sides are parallel to the z-axis and y-axis, respectively. 
(This may be accomplished without loss of generality by a suitable 
translation and rotation of R, together with V). The plane P, containing 
the perspective image S’ of S, is chosen to be perpendicular to the 
line joining V and the origin. A rectangular coordinate system with 
az’, y’-axes is chosen on P which preserves the original vertical direction 


V= (v, »V2 »V3) 





Fig. 1—Perspective projection of surface S on plane P. 
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OF GRID 


Fig. 2—Surface patch of one grid subrectangle. 


of S. That is, the y’-axis on P must be the projection of the original 
z-axis. This latter choice is necessary, as we shall see later. (Relative 
to the rectangular system on P, the perspective image on P of each 
point of S has a well-defined pair of coordinates. The evaluation of 
these coordinates will be given in the Appendix.) 

The surface S will first be quantized to an M X N rectangular grid 
of points on the domain R&. Figure 2 shows the grid lines on R, 
partitioning R into subrectangles. The part of S defined over one 
subrectangle is shown. It will be referred to as a “surface patch” of 
S. After quantization, only the four points of the patch defined at the 
four vertices of the subrectangle are known, and linear interpolation 
of the function will be assumed between adjacent grid points, that is, 
between adjacent points in the same row and adjacent points in the 
same column of the grid. (Thus, S need not be explicitly defined by 
a mathematical function; the data for S may be given by a rectangular 
array of points corresponding to a rectangular grid of points on the 
domain.) The behavior of S in the interior of each subrectangle of the 
grid will be ignored. Thus, each surface patch of S will be represented 
by its four linearly interpolated edges in three-dimensional space. The 
image of these edges on the projection plane is a four-sided polygon 
(Fig. 3). For each surface patch of S, only the visible line segments 
of its edges will be drawn. 

Intuitively, the surface should be thought of as an opaque elastic 
membrane stretched over a rigid frame consisting of all the linearly 
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Fig. 3—Projection of linearly interpolated surface patch. 


interpolated edges of the surface patches. Then the problem is to give 
a perspective drawing of the frame which eliminates line segments 
of the frame hidden by the opaque membrane from the vantage point. 
(Note that the vertical projection of the frame onto the z, y-plane 
coincides with the grid lines on R.) 


III. BASIC IDEAS OF THE ALGORITHM 


The algorithm for determining the visibility of any given edge is 
based upon two ideas. The first one consists of choosing a particular 
ordering of the surface patches of S so that no part of any 
patch occurring earlier in the ordering is obscured from the vantage 
point by any part of a patch occurring later. The surface is to be drawn 
according to such an ordering, one patch at a time. Then, a line segment 
on any edge of a patch under consideration is hidden from the vantage 
point if and only if its image on the projection plane lies inside a region 
of the plane already covered by the images of earlier patches. 

This particular ordering of the surface patches has the further 
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property that, for a surface S which corresponds to a single-valued 
continuous function over a rectangular domain, the region of the 
projection plane being covered by the emerging image of S will grow 
contiguously as each patch in the ordering is drawn. And because of 
the preservation of the vertical direction of the surface, at each stage 
of the drawing of S, this “hidden” region of the plane can be precisely 
bounded between two piecewise linear continuous functions. The 
determination of the visibility of any edge of a surface patch is thus 
reduced to the problem of deciding which part of its image on the 
projection plane lies between these two functions. This delineation of 
the hidden region of the projection plane, after each patch is drawn, 
by means of two piecewise linear continuous functions, constitutes 
the second idea behind the algorithm. 


Remark: If the y’-axis on the plane of projection is not chosen to be 
the perspective image of the original z-axis, then the image of the surface 
may be ‘‘tilted’”’ so that it will be impossible to delineate its boundary 
by using only two functions. 


IV. ORDERING OF THE SURFACE PATCHES 


The fact that it is possible to order the surface patches of S so that 
earlier ones are not obscured by later ones depends on the following 
observation: 

Suppose A and B are any two points of the surface such that A 
obscures B from V. Geometrically, this means that A, B, and V lie 
in a straight line, and the distance from A to V is less than the distance 
from B to V. Then it can be easily shown that the vertical projections 
A, , By , Vo of the points A, B, V, respectively, onto the z, y-plane 
satisfy this relationship also. That is, they are collinear, and A, is 
closer than B, to Vo. 

This observation is the key to the ordering of the surface patches. 
First, consider the case in which the vertical projection Vp = (v, , v2 , 0) 
of the vantage point onto the x, y-plane is southwest of the domain, 
as shown in Fig. 4. [That is, v, S x, and v. S y, , where (% , y: , 0) 
is the lower left-hand corner of the domain.] An immediate consequence 
of the above observation is that if A = (a, , a , a3) obscures B = 
(b, , bz , bs) from V, then a, < b, and a, < by ;i.e., Ay must be southwest 
of By (given the southwest location of V,). This suggests the following 
rule for determining the relative order of points of S, given the south- 
west location of V, : Let A and B be any two points of S. If one of 
them, say A, is southwest of the other (more precisely, if Ay is south- 
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V=(V,Va»V3) 






(x,, YU, 0) 0 


Vo = (v, ,V2,0) 


Fig. 4—Points of the surface with A obscuring B. 


west of By), then A must precede B. Otherwise, A and B may be ordered 
independently of each other. 

There are various ways to order the surface patches of S so that this 
rule for the relative order of points is satisfied, and hence, so that 
earlier patches are not obscured by later patches. Figure 5 illustrates 
one possible ordering of the subrectangles of R so that the induced 





Fig. 5—One possible ordering of subrectangles. 
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ordering of the patches of S has this property; i.e., they are ordered 
by rows, from left to right in each row, beginning with the front row 
(relative to the southwest location of Vo). It 1s obvious that, under 
this ordering of the patches, no point of an earlier patch will be obscured 
by a point of a later patch, since no point of a later patch is southwest 
of any point of an earlier patch. 

For other ‘‘corner’” locations of Vo (i.e., northwest, northeast, and 
southeast of the domain) analogous orderings of the surface patches 
may be made. Instead of discussing suitable orderings of the surface 
patches for other locations of V> , we shall proceed to describe the 
algorithm in detail for the special case in which V> is southwest of the 
domain, and then show how the other cases may be reduced to cases 
involving only ‘“‘corner’’ locations of Vo . 


V. THE ALGORITHM IN DETAIL 


Let S be a surface defined over a rectangular domain R& of the z, 
y-plane. Assume S is to be viewed in perspective from a vantage point 
V whose vertical projection V» onto the zx, y-plane is southwest of R. 
Given an M X N rectangular grid of points on R&, for J = 1,2,---,M 
and J = 1, 2, --- , N, let AU, J) denote the ‘sample point” on S 
defined at the grid point on FR located at the 7th column and J/th row 
of the grid. For J = 2,--- , MandJ = 2,--- , N, let SU, J) denote 
the surface patch with vertices AU, J), AU, J — 1), AU — 1, J), 
A(T — 1, J — 1). (See Fig. 2.) By above considerations, the patches 
may be ordered as follows, and drawn according to this ordering, 
one at a time: 


S(2, 2), SB, 2), nee, SM, 2); 
S(2, 3), SG, 3), nee SM, 3), 


S(2, WV); S(3, N), nae » SM, N). 


Now, consider the sequence of points A(1, J), J = 1, --: , N, defined 
along the first column of grid points. The line segments joining these 
points in consecutive order will be called the “leading left edge” of S. 
Similarly, the line segments joining the sequence of points A(J, 1), 
I = 1, --- , M in consecutive order will be called the “leading front 
edge” of S. As a consequence of the observation made at the beginning 
of Section IV, both leading edges are entirely visible from V. Thus, 
we may begin by drawing all the line segments of the leading edges. 
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Then the drawing of each successive patch S(J, J) in the ordering may 
be completed by adding the visible segments of just two of its four 
edges, namely its ‘back edge,” joining A(/, J) and AU — 1, J), and 
its “right edge,” joining AU, J) and A(/, J — 1), since the other two 
edges will have already been considered in connection with earlier 
patches or with one of the leading edges. 

Figure 6a gives an example of a surface defined over a 5 X 38 grid 
of points. The 5 X 3 array of sample points on the surface gives rise 
to two rows of patches, four in the front row and four in the back row. 
Figure 6b shows the conjunction of the leading left edge and the leading 
front edge of this same surface. 

Figures 7a through 7h show successive patches being added to the 
drawing of the surface. With each patch that is being added, L, denotes 
its “back” edge, and L, denotes its ‘right’? edge. Dotted lines indicate 
hidden segments which are not drawn. 

As indicated earlier, the determination of the visibility of each new 
edge under consideration will involve two piecewise linear continuous 
functions. Intuitively, one of these functions, Max, will be used to 
define the path of the upper perimeter of the region of the projection 
plane thus far covered, and the other function, Min, will be used to 
define the path of the lower perimeter. At this point, it would be useful 
to identify the grid of the plotting device with a rectangle in the pro- 
jection plane which circumscribes the perspective image of the surface. 
(This corresponds to a scaling and translation of the coordinates of 
the image points so that they fall within the dimensions of the plotting 
grid.) Then, in order to optimize the precision in delineating the hidden 
region of the projection plane, or equivalently, the hidden region of 
the plotting grid, the number of breakpoints of each of the functions 
Max and Min is chosen to coincide with the number of horizontal 
units of the plotting grid. 

Before plotting begins, we artifically set Max (k) = Min (k) = —1, 


A(1,3) 


A(3,1) 





A(i,1) 
(a) (b) 


Fig. 6—Surface and its leading left and front edges. 
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A(2,3) 
Ly ic 
eA 
(a) (e) 
A(3,3) 
L 
(b) (fF) 





(d) 


Fig. 7—Successive patches of surface being drawn. 


for k = 1, 2, --- , K, , where K, is the horizontal dimension of the 
plotting grid. Then, after each line segment is drawn, Max or Min 
(possibly both) will be modified to reflect the change in the shape of 
the emerging image of S. We begin with the line segments on the leading 
edges of S. Let A and B be any two consecutive “‘sample points” of 
the leading left edge or of the leading front edge, that is, any two 
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consecutive points in the following sequence (see Fig. 2): 
A(l, N), A(1, NO 1), ae) A(1, 1), A(2, 1), aes | A(M, 1). 


Let A, , A, be the (scaled and translated) plotting coordinates of A, 
and B, , B, the plotting coordinates of B. First, the line segment joining 
the points (A, , A,) and (B, , B,) is drawn. Then both Max and Min 
are redefined between A, and B, by setting 


Max (A,) = Min (A,) = A, 
Max (B,) = Min (B,) = B, 


and linearly interpolating both Max and Min between A, and B, . 
(In Fig. 8, K, and K, refer to the dimensions of the plotting grid, in 
the horizontal and vertical directions, respectively.) 

After the above procedure is carried out for every pair of consecutive 
points in the sequence, the leading edges will be completely drawn, 
and their path will be defined by Max and Min between the two end- 
points. Note that the region bounded between Max and Min has area 
zero, reflecting the fact that no patch has yet been drawn. 

We now proceed to draw the surface patches. Each S(J, J) in the 
ordering, beginning with S(2, 2) will be uniformly processed as follows: 
Let L, denote the edge between A(J, J) and A(I — 1, J), Lz denote 
the edge between A(J, J) and A(J, J — 1). For each of these edges, 
we must first determine its visible segments, then draw only those 
segments, and modify Max or Min to reflect the change in the shape 
of the hidden region (the entire edge may be hidden, of course). 

First, let LD = L,and A = AU, J), B = AU — 1, J), the endpoints 
of L, . The following criterion is used to determine the visibility of 
any point C on L: If C, , C, are the plotting coordinates of C, then 





‘i Bn Bop Kp 


Fig. 8—Linear interpolation of Max and Min. 
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C' ts visible if and only if C, 2 Max (C,) or C, S Min (C,). This visibility 
criterion is applied to each of the endpoints A and B. (Let A, , A, 
be the plotting coordinates of A; B, , B, those of B.) There are four 
possibilities: 


(i) A, B are both visible. Then all of LZ is assumed to be visible. 
(This assumption is made to reduce computation time. It may happen 
that some segments of hidden lines will be incorrectly drawn if the 
surface has sharp “‘spikes’”’ in the foreground. This problem can be 
circumvented by starting with a finer grid on R.) The line segment 
joining (A, , A,) and (B, , B,) is drawn. If A, 2 Max (A,), Max is 
linearly interpolated between A, and B, by setting Max (A,) = A, , 
Max (B,) = B, . Otherwise, we must have A, < Min (4,), and Min is 
similarly modified between A, and B, , with Min (A,) = A,, Min (B,) = 
B,. 

(iz) A and B are both hidden. Then for the same reasons as in case 
(z), we assume all of Z is hidden and no line segment is drawn. Max 
and Min are left unchanged, and we proceed to the next edge. 

(721) A is hidden and B is visible. A search is made along the line 
joining A and B, starting at A, for the first visible point C (discrete 
steps are taken in the horizontal direction of the plotting grid). The 
segment joining C' to B is assumed to be entirely visible and is drawn. 
If C, 2 Max (C,), Max is linearly modified between C, and B, with 
Max (C,) = C,, Max (B,) = B,. Otherwise, Min is modified between 
C, and B,. 

(tv) A is visible and B is hidden. A search is made along the line 
joining A and B for the first visible point C’, starting from the hidden 
point B. The rest is analogous to case (777). 


For each surface patch S(J, J), this procedure is carried out with 
L = L, and B = A(I — 1, J); then the procedure is repeated with 
L = L, and B = A(I, J — 1). This completes the description of the 
algorithm for the special case in which V, , the vertical projection of 
the vantage point onto the z, y-plane, is southwest of the domain R. 

If V, is in any of the other ‘‘corner” regions of the z, y-plane (indicated 
by NW, NE, and SE in Fig. 9a), we may rotate the data defining the 
surface (either the mathematical function or the rectangular array 
of sample points) by 90, 180, or 270 degrees, respectively, and apply 
the algorithm as described for V, in the SW region of the z, y-plane. 
Another approach is to change the ordering of the surface patches and 
to define new leading edges to suit the other locations of Vo . 

If V, is in any of the regions directly west, north, east, or south of R 


854 THE BELL SYSTEM TECHNICAL JOURNAL, APRIL 1972 


NW NE 





(a) 


Fig. 9—Locations of vertical projection of vantage point. 


(Fig. 9b), we may partition R into two subrectangles R, and R, 
(Fig. 10a), and apply the algorithm to each of R, , R. , since Vo is in 
a “corner” location relative to each subrectangle. The two parts of 
the surface, corresponding to R, and R, , will have to be “pieced to- 
gether” along the line of partition, and this may be done without 
“flaws” by augmenting the sample points of the surface with the set 
of points defined at the intersection of the partition line with the grid 
lines on R. 

Finally, if Vo lies inside the domain R, we may partition R into four 
subrectangles (Fig. 10b) and apply the algorithm to each of R, , R2 , 
R; , R, , with Vo in a “corner”’ location. 


Remark: Note that under the assumption that the plane of projection 
P be perpendicular to the line joining V and the origin, if V> coincides 
with the origin, then the projection of the z-axis from V onto P is a 
point. In this case, the y’-axis on P must be chosen to be the image 
of some other line. See the Appendix. 


Figures 11, 12, and 13 are sample drawings generated on the Honey- 
well 6070 computer with the Stromberg—Carlson 4060 microfilm plotter. 
Each surface was evaluated at a 64 X 64 grid of points, for a total of 
4096 data points. The average run time was 9 seconds for each surface. 
(Part of the 9 seconds was used in computing the function values and 


Vie Sas 





(a) 


Fig. 10—Partitioning of domain for non-corner locations of Vo. 
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plotting coordinates at each point of the 64 X 64 grid. In fact, these 
computations were performed twice, once for determining the scale 
factors and again for the actual plotting of the surface.) Total storage 
for generating microfilm output of each surface was 16K. 

It is clear that using this algorithm, computation time should vary 
linearly with the number of data points, since each data point is pro- 
cessed by evaluating its plotting coordinates 1X, JY, and then com- 
paring ZY with just two values, Max (JX) and Min (JX). 

Economically generated movies have also been produced on both 
the Honeywell 6070 and the DDP 516 computers. These movies show 
surfaces in rotation and surfaces undergoing continuous change in 
shape. The average run time on the Honeywell 6070 for making such a 
movie with a 32 X 32 grid and viewed from 240 distinct vantage points 
was 7.4 minutes, or about 1.85 seconds per frame. The corresponding 
run time on the DDP 516 was 7.5 seconds per frame. (For these movies, 
the scale factors were precomputed so that function values and plotting 
coordinates were evaluated only once for each frame of the movie.) 

For a rough comparison of computation times for hidden-line elimina- 
tion drawings based upon other algorithms (and implemented on other 
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Fig. 11—f(z, y) = (sin (x) — 1) (sin (y) — 1), ~7w@ S2,y Sm. 
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Fig. 12—f(z, y) = sin (w + y)/(1 + (@ — y)*), -27 Sa,y S 2z. 


computers) see Refs. 1 through 7. In particular, see Refs. 2 and 3 for 
a comparison with two other algorithms which deal with single-valued 
functions defined over a rectangular domain. Using the algorithm 
described in Ref. 2, computation time increases quadratically with 
the number of data points, since the visibility of each data point is 
determined by an exhaustive comparison with every other data point. 
Although computation time increases linearly using the algorithm 
of Ref. 3, the storage required is large in order to obtain high-quality 
drawings. This is because the “plotting page’’ is subdivided into rec- 
tangles, and, of all possible line segments of the surface whose images 
lie in any given rectangle, only the one closest to the vantage point is 
drawn in that rectangle. Thus, in order to avoid a “sketchy” drawing, 
a fine subdivision of the “plotting page” is necessary and a large array 
is required to store the information in each rectangle of the subdivision. 
(The algorithm of Ref. 3, however, is applicable to a larger class of 
surfaces than the algorithm described in this paper. For example, it 
can be used to draw surfaces corresponding to functions which are 
not single-valued, e.g., intersecting cylinders.) 
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VI. GENERALIZATIONS 


6.1 Convex Domains 


The algorithm may be generalized to an arbitrary surface S defined 
by a single-valued continuous function over a convex domain R, by 
artificially extending the defining function f to a rectangular domain 
R’ containing R. The ideas of the algorithm are then applied to the 
surface S’ defined by the extended function f’ over R’, with the following 
modifications: First, the sample points along the leading edges of S’ 
must be replaced by a subset of points defined at the intersection of 
the boundary of R with the grid lines on R’, and special consideration 
must be given to drawing the lines connecting these points. Second, 
a test must be made of sample points of S’ to determine if they belong 
to S, so that only line segments that are part of S are drawn. One way 
to accomplish this is by setting f’(z, y) = 2 for all (z, y) in R’ — R, 
where 2 is a number outside the range of values of f on R. (Convexity 
of the domain is necessary in order to bound the emerging images at 
each stage of the drawing between two functions.) 
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Fig. 18—f(z, y) = cos (2? + y*)/e9?@*4v") -2.5 Sa2,y S 2.5. 
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6.2 Multi-Valued Functions 


The algorithm may also be extended to certain multi-valued functions. 
For example, if the surface is a sphere, it is first partitioned by a hori- 
zontal plane into two single-valued pieces, and if the vantage point 
is above the horizontal plane, the algorithm is first applied to the upper 
hemisphere, and then, without reinitializing Max or Min, the algorithm 
is applied to the lower hemisphere. 


6.3 Bivariate Histograms 


With minor modifications the algorithm may be adapted to the 
perspective drawing of bivariate histograms, with elimination of hidden 
lines. Figures 14a through 14d show four different perspectives of a 
bivariate histogram generated by the same data. The computation 
time on the Honeywell 6070 was 1.6 seconds for each of the four drawings. 





Fig. 14—Bivariate histograms: (a) Vo in SW corner. (b) Vo in SE corner (c) Vo. 
in NE corner. (d) Voin NW corner. 
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APPENDIX 


Evaluation of Rectangular Coordinates of Image Points 


Let S be a surface defined over a domain centered at the origin of 
the x, y-plane, and let V = (v, , v2 , v3) be a point not on S. In order 
to simplify the calculation of the coordinates of image points and hence 
reduce computation time, a special plane P*, normal to V and passing 
through the origin, will be used as the plane of projection. (In Fig. 1, 
a plane P parallel to P* is shown. Note that the image of S on P differs 
from the image of S on P* only by a scale factor.) 

Let A = (a, , a , a3) be any point on S, and let B = (0, , bo , bs) 
denote its image point on P*, The plotting coordinates of A depend on 
the two-dimensional coordinates (u, v) of B with respect to an orthogonal 
system on P*. If X = (a, , v2 , 2) and Y = (y, , yo , ys) are the positive 
unit vectors of the coordinate system on P*, then the scalars uw and v 
must satisfy the following relationship: 


uX + 0Y = B. (1) 
Since B is collinear with the points A and V, then 
B=V++r(A —Y), for some scalar r. (2) 


Thus, eq. (1) may be replaced by 
uX +vY =V+rd —orV. (3) 
Taking the dot product of both sides of eq. (3) with X, we have 
uX-X +0Y-X =V-X + 7rA-X —1V-X. 
Since X-X = land Y-X = V-X = 0 (V-X = 0 because X on P* 
and P* normal to V), we have 
u=rA-X. (4) 
Similarly, taking the dot product of both sides of eq. (3) with Y, we 
have 
v=rA-Y. (5) 
The scalar r is determined by taking the dot product of both sides of 
eq. (2) with V. Since B lies on P*, then B-V = 0, and we have 
r=V-V/V-(V — A). (6) 
It remains to determine the coordinates of the unit vectors X, Y. 
As mentioned in Section II, the y’-axis on P* must be the perspective 
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image of the original z-axis. Note, however, that if V is on the z-axis, 
then the perspective image from V of the z-axis is a single point, which 
of course cannot be used as an axis line. In this case, we must com- 
promise by either moving the vantage point slightly off the z-axis 
or choosing the y’-axis on P* to be the image of a slightly ‘askew’ 
z-axis. For the rest of this discussion we assume that V is not on the 
2-axis, 

Let Z = (0, 0, 1). Then V, Y, Z lie on the same plane, because Y 
is the image from V of some scalar multiple of Z. Thus Z is a linear 
combination of V and Y. Since X-V = Oand X-Y = 0, then X-Z = 0. 
The normalization of any vector X’ satisfying the equations X’-V = 0 
and X’-Z = 0 will be a unit vector on the 2’-axis of P*. X’ = 
( — v2, v,, 0) satisfies these two equations. Let d, = Vv + v2. Then 


X = (—v2/dy ,01/d, , 0) 


is the positive unit vector on the z’-axis of P*. The coordinates of Y 
are similarly determined from the equations Y-X = 0 and Y-V = 0. 


Let d = Voi +; +03. Then 
Y= (v3 , —V203 , d;)/dd, 


is the positive unit vector on the y’-axis of P*, Substituting these values 
of X and Y back into eqs. (4) and (5), we have 


UU = (a0; = V2) /dy (7) 
Vv = 7r(—A,v1vV3 — AgVgv3 + asd;)/(dd,), (8) 


where r = d’/(d’ — (ayv, + a2v2 + av3)), as defined in eq. (6). 
The above expression for v is algebraically equivalent to the following 
one, which involves half as many arithmetic operations: 


v = (v3 + r(a3 — 03))-d/dy. (9) 


Thus, given a point A on S, the rectangular coordinates (u, v) of 
its image point are evaluated according to eqs. (7) and (9). 
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Some Effects of Quantization and Adder 
Overflow on the Forced Response 


of Digital Filters 


By A. N. WILLSON, JR. 
(Manuscript received December 28, 1971) 


The effects of quantization (2.e., roundoff, truncation, etc.) and adder 
overflow, which are present in any special-purpose computer type realiza- 
tion of a digital filter, cause an otherwise linear system to become quite 
nonlinear. Moreover, the presence of such nonlinearities can cause the 
system’s response to differ drastically from the ideal response (that is, 
from the response of the linear model of the filter) even when the level of the 
jilter’s input signal is, in a certain reasonable sense, small, and when the 
quantization effects are made arbitrarily small. 

In this paper we derive a criterion for the satisfactory behavior of second- 
order digital filters in the presence of such nonlinear effects. The criterion 
ts shown to be sharp, in that we also present a procedure for constructing 
counterexamples which show that, for most filters which violate the criterion, 
the response to some ‘‘small’’ nonzero input signal is not always even 
asymptotically close to the ideal response. 


I. INTRODUCTION 


The effects of quantization (i.e., roundoff, truncation, etc.) and adder 
overflow are present in any special-purpose computer type realization 
of a digital filter. When taken into account, these effects cause an 
otherwise linear system to become quite nonlinear. To date, the analysis 
of limit cycle phenomena in such nonlinear digital filters has been 
concerned with the study of the zero-input response of second-order 
filters."~-* A more fundamental problem is that of determining whether 
or not a filter’s response to a nonzero input (the forced response) is in 
some meaningful sense close to the ideal response. This problem seems 
to have been ignored. 

If we consider input sequences, the levels of which are sufficiently 
small (in the sense that when the input sequence is applied to the linear 


863 


864 THE BELL SYSTEM TECHNICAL JOURNAL, APRIL 1972 


model of the filter, the response eventually lies within the open interval 
determined by the most positive and the most negative machine 
numbers), then it is tempting to conjecture, as if the system were 
linear, that when the filter’s zero-input response can be made to admit 
only limit cycles of small amplitude by using sufficiently many bits in 
the representation of the data so that the quantization errors are made 
sufficiently small, then the deviation of the filter’s forced response 
from the ideal can also be made small in the same manner. As will be 
shown by counterexamples, however, this conjecture is false. Thus, 
since the usual purpose of a digital filter is the processing of nonzero 
signals, a question of major importance becomes: How can it be deter- 
mined that, in the presence of quantization and adder overflow, a 
digital filter’s forced response will be satisfactory? 

In this paper we analyze the forced response of second-order digital 
filters which employ a type of arithmetic that has been called saturation 
arithmetic.t The essential structure of a second-order digital filter is 
shown in Fig. 1 where, for given real numbers a, b the filter’s output 
sequence? v, k = 1, 2, --- , is uniquely determined by the input 
sequence u, k = 1, 2, --- , and by v”, v, the initial values of the 
filter’s state variables. We develop a criterion by which satisfactory 
behavior of the filter can be determined. The criterion is shown to be 
sharp, in the sense that our counterexamples show that for most filters 
which violate the criterion, the forced response is not always close to 
the ideal response. 

More precisely, we show that when the filter’s coefficients a, b are 
determined by any point lying within the open crosshatched region 
of Fig. 2, and for any input sequence whose level is small (in the sense 
mentioned earlier), then the response of the nonlinear filter will be 
asymptotically close to the ideal response. On the other hand, we show 
that when the filter’s coefficients are determined by any point lying 
within the shaded regions in the lower corners of the triangle of Fig. 2, 
and when certain very reasonable assumptions are satisfied concerning 
the nature of the quantization, then there exist input sequences the 
levels of which are also small, but for which the filter’s response is not 
asymptotically close to the ideal response. 


+ The definition of this term is given in Section II. 

+In many applications some linear combination of the quantities v™, v1), 
v&~) is taken to be the filter’s output at the Ath time instant. This additional com- 
plication has no bearing on the matters considered here. For simplicity, therefore, 
we consider the sequence v“ to be the filter’s output. 
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ufk) y(k) ylk-1) y(k-2) 





Fig. 1—Second-order digital filter. 


II. SECOND-ORDER FILTERS 


The usual method of designing digital filters* employs the inter- 
connection of many second-order filters. The analysis and design of 
second-order digital filters is therefore a problem of considerable 
practical importance. 

The behavior of the digital filter of Fig. 1 is characterized by the 
linear difference equation 


0 
w? = Aw? + 





| k=0,1,2,:--, (la) 


k+1) 
u' 





Fig. 2—Region determining filter coefficients for which the filter’s forced response 
can be made asymptotically close to the ideal response (crosshatched region), and 
region determining coefficients for which the forced response will not always be 
close to the ideal response (shaded region). 
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where A denotes the 2 X 2 matrix 


ie? ) (1b) 
ba 


and w“? is a two-dimensional vector (specifying the state of the system 
at the kth time instant) the second component of which, wi", corre- 
sponds to the digital filter’s output sequence v’. 

In any special-purpose computer type realization of the digital filter 
of Fig. 1 the ideal behavior specified by (1) can be only approximated. 
At each time instant, the output of the summation point can assume 
only one of a finite number of values. Therefore, the actual value of 


the summation point’s output is given by an expression such as 
i f(av*-? + by ot u) + e® 


where the function f accounts for adder overflow and the sequence e“”’ 
accounts for the quantization error that is inherently present. The 
equality f(é) = & is satisfied only in a certain neighborhood of the origin 
which we take to be the interval —1 S & S 1. We consider filters 
employing saturation arithmetic; that is, we define f() = —1foré < —1 
and f(é) = 1 foré > 1. 

When the effects of quantization and adder overflow are taken into 
consideration, the digital filter of Fig. 1 is then characterized by the 


nonlinear difference equation 
0 
)+| |. ' OF 1525 5 (2) 
ett) 


ren. pf Ar 4+ | © 
(k+1) 
where the state of the system at the kth time instant is now specified 
by the two-dimensional vector 7“. The mapping F is defined by the 


relation 
(y) 


Since the purpose of our study is to examine the effects of quantization 
and adder overflow on the forced response of digital filters, we are 
interested in comparing the solutions of (1) and (2) when the equations 
are given identical input sequences and identical initial conditions. We 
make the reasonable assumption that we are concerned only with digital 
filters whose linear model, i.e., eq. (1), is asymptotically stable. It is 
well known that (1) characterizes an asymptotically stable linear 
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system if and only if each eigenvalue of the matrix A has magnitude 
less than unity. The eigenvalues of A (the roots of the polynomial 
\” — ad — b) are known to have magnitude less than unity if and only 
if the coefficients a, b have values determined by points that lie within 
the large open triangular region shown in Fig. 2 (determined by the 
straight lines:b + a = 1,b = —1). 

It is clear that so long as the filter’s input sequence is such that the 
solution of the linear equation (1) is continually being driven into the 
region || w” || > 1, then there is little point in trying to compare 
the solutions of (1) and (2); it being clear at the start that at each such 
time instant, they will differ by at least the amount by which || w? || 
exceeds unity (plus or minus the quantization error e“’ which, pre- 
sumably, will be small). At the other extreme, if it is known in advance 
that the initial conditions and the input sequence are (small enough 
in magnitude) such that the solution of the linear equation (1) is within 
the range || w || < 1 — 6, for some 6 > 0, and for all k = 1,2, --- , 
then there is no problem. That is, it is clear at the outset (due to the 
assumption that the linear system is asymptotically stable) that the 
solutions of (1) and (2) will be made arbitrarily close for all such inputs, 
by simply causing the magnitude of the quantization error e” to be 
bounded by a sufficiently small number. In effect, the nonlinear function 
f is then not present; we are simply comparing the responses of the 
same stable linear system to two slightly different inputs. 

The interesting question which we shall consider is the one which 
follows. Suppase we assume only that the filter’s input is such that the 
ideal response, the solution of the linear system (1), eventually (1.e., for 
all k sufficiently large) satisfies || w“’ || < 1 — 6, for some 6 > 0. Then, 
when is the same thing (ie., || 7 || < 1 — 6 for some 5 > 0, and all k 
sufficiently large) true for the solution of eq. (2)? Thus, we are interested 
in knowing when the gross effects of the nonlinearity are simply of a 
transient nature and hence, aside from such transient effects, when can 
the filter’s response be made as close to the ideal as desired by simply 
causing the quantization error to be sufficiently small (.e., by using 
a sufficient number of bits in the representation of the data). Unfor- 
tunately, as our counterexamples will show, it is not always the case 

t For each w = (w1, we)? we define ||w|| = max {|w,|, |wel}. 

t The inequality ||w®|| < 1 might seem more reasonable here. The necessity to 
write 1 — 6on the right-hand side is the small price that we must pay for the freedom 
to treat the quantization error in the relatively simple manner that we have chosen. 
By considering the quantization error at each step to be simply a “‘small’’ input 
e™, we do not admit to the knowledge that, for example, in all sufficiently small 


neighborhoods of the points £ = +1, the quantization (be it roundoff, truncation, 
or whatever) will be done in such a manner that |& + e™| S 1. 
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that this will occur in the nonlinear system whenever the linear system’s 
response satisfies || w“” || < 1 — 6 for some 6 > O and all sufficiently 
large k. 

With our objective thus being to compare the asymptotic behavior 
of the solutions of (1) and (2), and since the linear system (1) is assumed 
to be asymptotically stable, it is clear that we may drop the requirement 
that the equations have the same initial conditions. This follows, 
of course, from the fact that the initial conditions of (1) do not affect 
the solution’s asymptotic behavior. 

By including the quantization effects in the linear model of the filter, 
the system is then described by the equation 


sar ms + | leaps k=0,1,2,-:-, (4 


k+1 k+1 
u' ) e! ) 


whose solution can be made arbitrarily close to the solution of (1) by 
simply requiring that all |e“ | be sufficiently small. Let us assume, 
therefore, that the |e“ | are at least small enough that there exists 
6’ > 0 and a nonnegative integer K such that, for all nonnegative 
integers k = K, 


eae |e ol ale a (5) 

Letting 
=r? — 6, = 0,1,2,---, 6) 
we find, from (2) and (4), that the sequence 2“ is determined by the 


equation 
nl Cal 0 |)- | 7 |. k=0,1,2,---, (7) 
(k+1) yr 





yp 


where 2 = r — 5 and, fork = 0, 1,2, --° ,v@*? = sft) — et 
which, according to (5), with e = 1 — 6’, satisfies 
lagers | see fork 2 K. (8) 


We take as our objective, therefore: To determine when, for any 
sequence y“*”, k = 0,1, 2, --+ , satisfying (8) for some ¢ in the interval 
0 S « < 1, and some nonnegative integer K, the solution of (7) satisfies 
lim; || 2” || = 0. 

We note at this point that our objective stated in the preceding 
paragraph is similar to the objective in Ref. 3 (see the paragraph 
immediately following eq. (8) of that paper) where the control of 
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limit cycles in the zero-input response of second-order digital filters is 
considered. The important difference between the two objectives is 
that here we must accommodate any value of « in the interval [0, 1). 
In Ref. 8, however, it was only necessary to consider bounds on the 
sequence v“*” that were ‘sufficiently small’. The consequences of 
this difference are great. It will be clear that a much more delicate 
analysis is required here than that in Ref. 3. 


III. ANALYSIS OF THE FORCED RESPONSE 


We now determine, in accordance with the objective explained in 
Section IJ, a criterion for the satisfactory behavior of the forced response 
of second-order digital filters in the presence of quantization and adder 
overflow. We consider filters employing saturation arithmetic; that is, 
we define the function f of Section II by 


ie for —€< —-1 
f@=) & for -l1sSé<1 (9) 
1 for —>1. 
The following theorem is fundamental to our analysis. 


Theorem 1: Let the matrix A be defined by (1b) in which the values of 
a, b are specified by some point lying within the open triangular region 
of Fig. 2 (determined by the straight lines: b + a = 1, b = —1). Let the 
mapping F be defined by (8) in which the function f is specified by (9). 
Then, for any sequence v“*”, k = 0, 1, 2, «++ , satésfying (8) for some e 
in the interval O S « < 1 and some nonnegative integer K, the solution of 
(7) satisfies lim, || 2“ || = 0 provided that there exists a real number o 
such that 


1— oa’ >0, (10) 
[1 — 6° — (1 — o)a’}’ — a'[o + (2 — o)b)’ > 0, (11) 

and 
> |b? — (1 — o)a’|. (12) 


The proof of Theorem 1 is given in the Appendix. We now seek to 
determine those points lying within the triangular region of Fig. 2 
which specify values of a, 6 such that the inequalities (10), (11), (12) 
are satisfied for some value of c. 
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We begin by examining the case in which o = 0. In this case (10) is 
satisfied for all a, b and, as shown in the Appendix, (11) is satisfied 
for only those values of a, 6 specified by points lying within the open 
crosshatched region of Fig. 5 which, for « = 0, is the open crosshatched 
region of Fig. 3. By squaring each side of the inequality (12), it is 
easily shown that that inequality, with ¢ = 0, is equivalent to 


1—-@—b)+V(1 — a? — b’)? — 407’ > 0. 


Since values of a, b specified by points lying within the crosshatched 
region of Fig. 3 satisfy 1 — a’ — b” > 0, it is clear that all such values 
(and only those values) of a, b satisfy (10), (11), and (12) for o = 0. 

For negative values of o and for c = 2 it is clear that the crosshatched 
region of Fig. 5 lies interior to the crosshatched region of Fig. 3. Thus, 
consideration of such values of o can determine no values of a, b that are 
not already determined in Fig. 3 by consideration of the o = 0 case. 

We now show that values of o in the interval 1 S o < 2 yield no 
values of a, b satisfying (10), (11), and (12) that cannot also be deter- 
mined by considering some value of o in the interval 0 < o S 1. Let é 
satisfy 1 S$ 6 < 2 and then define ¢ = 2 — 6. Clearly 0 < ¢ S$ 1. 
Now, if (10) is satisfied for o = ¢, then, clearly, (10) is also satisfied 
for o = &. The expression on the left-hand side of (11) can be rewritten 
as (1 — 0b’)? + a’{[a? — 2(1 + b”)] — [a — (1 — 6)"Jo(2 — o)}. The 
form of this expression shows that it has the same value for ¢ = é and 
o = Go. Finally, it is clear that if (12) is satisfied for ¢ = 6, then (12) is 
also satisfied for ¢ = é since [using our observations regarding (10) 
and (11)] the left-hand side of that inequality is not decreased by 





Fig. 3—Region in which inequality (11) is satisfied for « = 0. 
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replacing é with ¢, and since 
|b? — (1 — é)a* | = [6° + (1 — a’ | 2 |b? — A — @)a’ |. 


There remains to consider only those values of o in the interval 
0 < o < 1. Thus, for each such value of « we wish to determine the 
values of the parameters a, b specified by points lying within the open 
crosshatched region of Fig. 5 and, from (10), within the open region 
specified by |a| < 1/c, for which the inequality (12) is satisfied. It is 
not difficult to show that for each value of o satisfying 0 < o S 1 the 
function 


g(a, b) 


HaV1— ce +V il — 8 — (1 — oa)? — ao + (2 — 0b]? 
— |b — (1 — aja’, 


whose domain is that portion of the open crosshatched region of Fig. 5 
where | a| < 1/c, is monotone decreasing in | a| for each value of b 
in the interval —1 < b S 0. Thus, the region in which the inequality 
(12) is satisfied is easily located by determining the curves ¢,(a, b) = 0. 
Moreover, because of the above observation concerning the monotonicity 
of ¢, , it is easy to determine these curves numerically. Several such 
curves, for various values of o, are shown in Fig. 4. 





Fig. 4—Location of the ¢,(a, b) = 0 curves for several values of c. 
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The region in which the inequalities (10), (11), (12) are satisfied for 
some real number, gc, is the union of the regions determined by the 
g(a, b) = 0 curves for 0 S$ o S 1. The numerical results show that 
this region has the shape indicated by the crosshatched area in Fig. 2. 
The boundary of this region appears to be determined by the straight 
lines b + a = 1 for b = —3, and by the ellipse a’ + 8b(1 + b) = 0 
forb S —34. 

It is clear that there are several ways in which our analysis could be 
refined in order to provide the possibility of improving upon the result 
of Theorem 1. In the next section, however, we show a fundamental 
limitation on the extent of any such improvement. We show there, how 
to construct counterexamples which demonstrate that for a certain 
large portion of the uncrosshatched area of the triangular region of 
Fig. 2 (in particular, the shaded areas in each lower corner) the con- 
clusion of Theorem 1 is, in fact, false. 


IV. COUNTEREXAMPLES 


We now show how to construct the counterexamples which have 
been referred to in the preceding sections. We begin by showing that, 
when the function f is defined by (9), and when the values of the filter’s 
coefficients are determined by any point lying within the open shaded 
regions in each lower corner of the triangle shown in Fig. 2, then there 
exist nonzero initial conditions and, for some e < 1, a periodic input 
sequence v“*” satisfying (8) such that the solution of (7) is periodic 
(and thus does not satisfy lim,.... || 2“ || = 0). 

We first consider values of a, b determined by any point lying within 
the shaded open region in the lower left corner of the triangle of Fig. 2. 
In particular, we assume that 


ab > 1. (13) 
It is also clear that the inequality 
a < —bd’ (14) 


holds for any such point. Denoting the initial condition 2“ by 2 = 
(x,y)? = (p, qg)”, and considering an input sequence having period 
three, specified by v\” = 0,» = 1— pp,» = — 1 — gq, it is clear 
from Table I that (7) has a nonzero periodic solution provided that 
values of p, g can be found such that the inequalities specified in paren- 
theses in Table I are satisfied. 


The inequalities in the »“*” column [which must hold in order that 
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the input sequence satisfy (8) for some « < 1] and the inequalities on 
the first line of the column labeled “bz + ay“ + v“*»” are equiv- 
alent to: 


0O<p< 2, 
—2<q< 0, 
—1 < bp+aq <1. (15) 


Thus, so long as we consider only positive values of p and negative 
values of q, these inequalities will always hold whenever p and q have 
sufficiently small magnitude. The remaining two inequalities specified 
in Table I will be satisfied provided that 


(1 — ab)p — (a + b)qg < 0 
and 
(a + b’)p — (1 — ab)qg < 0. (16) 


In view of the inequalities (13), (14), it is clear that there exist values 
of p > 0 and q < O such that (16) is satisfied. Moreover, it is clear 
that the magnitudes of p and q can be scaled such that the inequalities 
(15) are also satisfied. Thus, there exists a nonzero periodic (of period 
three) solution of (7). 

For values of a, b determined by any point lying within the open 
shaded region in the lower right corner of the triangle of Fig. 2 a similar 
line of reasoning shows that a nonzero solution of period six can be 
obtained. The existence of such a solution is easy to demonstrate by 
noting the odd-symmetry of the function f and, with the initial condition 
z = (p, q)", showing that, withy” =0,v? =1+ p,7»° =—1+4+4, 
there exist values of p, q such that 2 = (—p, —q)*. We omit the 
details. 

The above procedure for constructing counterexamples is concerned 
explicitly with the solutions of (7). The simple relationships between (7) 
and the original equations of interest, i.e., eqs. (1) and (2), are described 


TasBLE I-—COoNSTRUCTION OF A PERIODIC SOLUTION FOR EquaTION (7) 











k th) y) ylkth) br® + ay) 4 yep f — yet 
ol ep | a | 0 | (1K) bp Fag (<1) | bp Faq 
I q bp + aq) (=1<) 1—p (<1) | 69 + ebp +a +1 =p (>1) Pp 
2 | bp + ag p (-1<) -1 ~—q (<1) | b?p + abg + ap — 1 — gq (<-1) q 
3 P q 0 
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in Section IJ. It is instructive, however, to consider explicitly the 
implications of such counterexamples concerning the solutions of (1) 
and (2). 

Let any values of the parameters a, b, determined by some point 
lying within the open shaded regions in the lower corners of the triangle 
of Fig. 2, be given. Consider any counterexample constructed according 
to the above procedure. Then, assuming that the quantization occurs 
in an appropriate manner (or, assuming that there is no quantization) 
it is a straightforward matter to use the relationships between the 
variables of (1), (2), and (7) to demonstrate a periodic input sequence 
u™ and appropriate values of the initial conditions v°~”, v such that 
the response of the linear model of the filter of Fig. 1 [i.e., w’, the 
solution of (1)] is asymptotic to a periodic sequence, and satisfies 
|| w || < 1 for all sufficiently large k, while the response of the non- 
linear filter [i.e., 7“, the solution of (2)], although also periodic, is such 
that || wv — r™ || does not approach zero as k > ©. 

These counterexamples, while clearly demonstrating that there exists 
potential trouble whenever a filter’s coefficients are assigned such ‘“‘bad” 
values, do not show that such behavior will necessarily be possible 
for some particular filter. They do not demonstrate, for example, that 
with a particular (specified) kind of quantization, and with a particular 
set of permissible values for the filter’s input sequence, there will 
necessarily exist a periodic input sequence for which the linear, and 
the nonlinear digital filters have asymptotically different responses. It 
is possible, however, by considering at the outset the details of the 
quantization and thereby imposing somewhat different constraints 
(to those of Table I) on the values chosen for p, q, v°”, to construct 
certain counterexamples which show just that. 

We assume that the values specified for the parameters a, b are 
determined by a point lying within the open shaded region in the lower 
lefé corner of the triangle of Fig. 2. (A similar development could, of 
course, be considered for the other shaded region.) We also assume 
that a certain finite set Q of allowable machine numbers, satisfying 
xe Q= |x| S 1, is specified. Thus, we assume that for the nonlinear 
digital filter with quantization the variables u®, v®, v®~, vo“? of 
Fig. 1 can assume only those values specified by the set Q. Furthermore, 
we assume that the filter employs saturation arithmetic with the over- 
flow and quantization effects both specified by a certain function f, ; 
that is, given any values for u“’, v“~”, v“~” taken from the set Q, 
the value for v appearing at the output of the summation point in 
Fig. 1 is specified by 
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yp? a fa(av*-? + by *-»? ai u”), (17) 
If, for example, with a = —1.3 and b = —0.9, the values u™ = 0.0, 
y*-Y) = —0.9, vw” = 1.0 are considered; and if the quantization is 


accomplished by simply rounding the ideal output of the summation 
point to the nearest tenth, the value v specified by (17) isv™ = 0.3. 
Clearly, if the set Q imposes sufficiently severe (indeed, for practical 
purposes, unreasonable) restrictions on the values that the input sequence 
and the initial conditions may assume, then it will be impossible to 
construct a counterexample. It is no surprise, therefore, that the success 
of the process to be described depends upon the assumption that the 
quantization is ‘sufficiently fine’ (that is, that there are sufficient 
quantization levels distributed throughout the interval [—1, 1]), and 
that when | av“~” + bv“? + u™ | < 1, the actual output of the 
summation point is reasonably close to the ideal value, that is, 


f.(av*-” + by *-” + u) My au? + byt” = a”. (18) 


We first note that the values of a, b determined by any point lying 
within the open shaded region in the lower left corner of the triangle of 
Fig. 2 are such that —1 < a — b < 1. Thus, since b/a > 0, we also have 
—1<a-—b+ )/aanda— b < 1. This ensures that the open intervals 
(—1, 1) and (a — b,a — b + b/a) overlap. Hence, if the quantization is 
sufficiently fine, there exists u"” e Q such that 


—a/fb<0<b-—atu™ < db/a <1. (19) 
Thus, for such a value of u‘”, 
b—a(b—atu™”) <0 
and 
a+ bibb—atu”) <0. 
Henee, for sufficiently fine quantization, there exist u™, u® e Q such that 
1+b-—ab—atu”) —u® <0, (20) 
and 
ltat+bb—atu”)+u% <0. (21) 


We let rx denote the most positive value in the set Q and let rnin 
denote the most negative value in Q. We also let 


fa(@?min 1 ae + u?), 


(2) 
T2 = Tmax) 


(3) 
To = lmin- 
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Now, assuming that (18) holds, it can be expected, due to (20) and 
(21), that there exist p, q such that 


(1 — ab)p — (a? + bg = 1 — Drain — arf? — u® <0, (22) 
(a + b*)p — (1 — ab)g = 14+ atin + Orf? +u® <0. (23) 


Moreover, if the quantization is sufficiently fine, the values for wu” and 
u® can be chosen such that 1 — brain — arf? — u® 0 andl + 
Omax + br§? + u™ = 0, and such that the values of these expressions 
are in the proper ratio that, in fact, small values of p > O and gq < 0 
are determined by the equations in (22), (23). Thus, since for sufficiently 


fine quantization 


Armin + Ofmx UY” Sb-—atu”, (24) 
and, due to (19), it is reasonable to expect that there exists v‘” such that 
—1 < bp tag +e = afgin + Ofmx tu” <1. (25) 


Furthermore, for p > 0, gq < 0 small, we expect that the following 
inequalities also hold: 


—1 <rj” — bp — aq <1, (26) 
a1 < Tmax — D < 1, (27) 
Ay ee ee eng ST, (28) 


Assuming therefore that the values of u, u, u®, p, q, v are 


such that (22), (23), (25), (26), (27), and (28) hold, we proceed with 
the construction of a counterexample by simply assigning the values to 
the remaining variables that are dictated by the relationships specified 
in Section II. In particular, we let 


(2) 
’ 


3) = 1 bp — aq, oe) = cS yr 
8. = Tmax — PD; e =-l1+ Tmax) 
gy? = Yimin —_ q; e® -_ 1 + tmin) 

Mele P; 
ee oot i, q; 
and 
T= %, St =, 
ri?) = ees si? _ a” 


(3) (2) (3) (2) 
Ty =f7o. 5 S§; = 8) . 
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At this point, one final step remains in our construction of a counter- 
example. We have obtained periodic solutions of eqs. (4) and (2), with 
the solution of (4) satisfying || s“ || < 1. We would like to obtain the 
corresponding periodic sequence to which the solution of (1) is asymp- 
totic. This sequence, which we shall call 2%“, is easily determined by 
the equations 


ws 1 -—b -al "fu 
os} =|-a 1 —bl lu], 
3” —Pee@: ay ola 
MW? = wf, 

Oi” = Hf, 

es = 


We have now found a true counterexample only if the values of 0” 
are also such that || #“ || < 1. It is reasonable to expect that this 
inequality will hold, however, since || “ — s || is known to be 
small provided that the values of e“”, e‘, e® are small, and these 
terms will be small whenever the quantization is sufficiently fine [note 
that e? = r — bp — ag — v, and recall the equality expressed 
in (25)]. 

Computer programs have been written which use the above process 
for constructing counterexamples, and which simulate the behavior of 
linear and nonlinear digital filters. It has been our experience, based 
upon experimentation with these programs, that counterexamples of 
the type described above can easily be found for values of the coeffi- 
cients a, b determined by points lying within the shaded region in the 
lower left-hand corner of the triangle in Fig. 2 even when the quantiza- 
tion is extremely coarse, much coarser than the quantization occurring 
in current practical digital filter realizations. We give, for example, 
the following numerical counterexample, constructed according to the 
above procedure, in which we have intentionally considered very coarse 
quantization, and have also made the task even more difficult by 
choosing u"’, wu, wu in, obviously, a somewhat less-than-optimum 
manner, with the result that |p| and | q| are larger than necessary. 
This does, however, cause the resulting sequences r“’, w“” to be quite 
different. 

We assume that the coefficient values a = —1.8, b = —0.9 have 
been specified. We also assume that 


) 
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Q = {—0.9, —0.8, --- , 0.9, 1}. 


We assume that the quantization is performed by simple rounding, at 
the output of the summation point, of the ideal sum to the nearest 
tenth. We then have ra. = 1, min = —O0.9, and therefore, choosing 


uw? =0, uw” =06, u® = 02, 


it follows that 


rm =03, rf =1, rf = -09. (29) 
We find that the approximate values of p, q, specified by (22), (23) are: 
p = 0.711, gq = —0.128. Following the above outlined procedure, we 


find that all of the required relationships hold. The resulting periodic 
» is asymptotic is specified by 


A 


sequence #™ to which the sequence w‘ 
the following approximate values: 


0” = 0.905, #6 = 0.135, 20° = —0.790. (30) 
Note that quite different solutions are specified by (29) and (30). 


Vv. THE FORCED RESPONSE AND INPUT SCALING 


We have shown in Section IV that the forced response of a stable 
second-order digital filter employing saturation arithmetic might not, 
for some inputs, be even asymptotically close to the filter’s ideal response 
(the response of the linear filter) if the coefficients a, b are specified 
by a point lying outside the crosshatched region of the triangle in 
Fig. 2. More precisely, we have shown that this certainly happens for 
coefficient values determined by points lying within the shaded regions 
in each lower corner of that triangle (so long as certain reasonable 
assumptions hold concerning the nature of the quantization). Thus one 
concludes that, when designing a filter, it is desirable to avoid choosing 
such coefficient values. In practical applications, however, it might be 
the case that due to other considerations such a choice cannot be 
avoided. Then it is clear that the designer must be careful to impose 
appropriate restrictions on the filter’s input sequence and on its initial 
conditions. He might, for example, scale the input sequence such that 
it is always small enough. The question thus arises: How small is 
“small enough”? One obvious answer to this question is that the input 
and the initial conditions be required to be small enough that the re- 
sponse of the linear filter [described by (1)] satisfies, for some 6 > 0 and 
all k = 0, 1, 2, --+ , the inequality || w™ || < 1 — 6. Then, by using 
sufficiently many bits in the representation of the data, the quantization 
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error can always be made sufficiently small that the adder overflow 
nonlinearity is not encountered. 

The results contained in a paper’ on limit cycles can provide another 
answer to this question. This answer requires consideration of only the 
asymptotic nature of the input sequence, and applies to filters using a 
variety of kinds of arithmetic including, in particular, saturation 
arithmetic. It is clear from the analysis presented in Ref. 3, that it is 
sufficient that the input sequence u“*” and the quantization error 
sequence e“*” be such that the solution of (4) satisfy, for some non- 
negative integer K, the inequality 


| sg” {| a. ler” | = é, fork > K, 


where 6 is one of the bounds specified in Theorem 1 of Ref. 3 for the 
sequence y“*”, In the case of saturation arithmetic we have 


6 = max (pet 1 — jh, 


2+ jal’ 1+ |b] 
Then, it is clear (by Theorem 1 of Ref. 3) that the solution of (7) 
satisfies lim,... || 2“ || = 0, which therefore ensures proper asymptotic 


behavior of the forced response of the nonlinear filter. 


VI. ACKNOWLEDGMENT 


The author is pleased to acknowledge the helpful comments of his 
colleagues J. F. Kaiser and I. W. Sandberg concerning this work. 


APPENDIX 


The proof of Theorem 1 follows. The proof uses the following well- 
known result concerning the application of Liapunov’s “‘second method” 
to the study of the stability of difference equations.°~’ 


Lemma 1: Let G denote a subset of the n-dimensional Euclidean space 
E” containing the origin 6. If there exist continuous functions W:G — E’, 
V:G — E’, and if there exists a nonnegative integer K such that: 


(4) W(z) > O for allzeG,2z # 6, 
(2) W(8) = 0, 
(iz) V(z) 2 0 for allzeG, 
(iv) AV(k, 2) = V(g(k, z2)) — Viz) S —Wee) for all k = K and all 
z22G, 


then each solution of the difference equation z2“*? = g(k, 2) which 
remains in G for allk = K approaches the originask > o. 
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For any particular application, the effectiveness of Liapunov’s method 
is of course highly dependent upon the appropriateness of the particular 
Liapunov function V that is chosen. The quadratic form that will now 
be described is quite useful for our purposes. 

For any given values of the parameters a, b which specify an asymp- 
totically stable linear digital filter, and with the eigenvalues of the 
matrix A of (ib) being denoted by 2, , Az, let the Liapunov function V 
be defined by 


V(z) = 2’Bz, (31) 
with 
2 2 =; 
Rea tA + [dal + 2u "] (32) 
—oa 2 


where the values of o and yp are yet to be determined. 

In the following lemma we determine, for any given value of oc, those 
values of » for which the matrices B and B — A’BA are positive 
definite. 


Lemma 2: Let o be a given real number. Then, necessary and sufficient 
conditions for the matrices B and B — A’ BA both to be positive definite 
for values of a, b which specify an asymptotically stable linear digital filter 
are: that the values of the parameters a, b be restricted to those values 
specified by points lying within the open crosshatched region of Fig. 5, 
and that, with ny < pe specified by 


Mao = B{1 +B? — (1 — oda? — (Ia? + [rel?) 
+£V[1 — 8 — (1 = o)a*} — as + (2 — o)b)"}, (88) 
a value be assigned to p such that 
by Sp < pe. 


Proof: It is clear that a necessary and sufficient condition for the 
matrix B of (32) to be positive definite is that det B > 0, which is 
equivalent to the inequality 


w> —3M + | As |’) + 40°". (34) 
The matrix B — A*BA, which has the form 
haa + [da[?) = 20? + 2y — als + 2 — )b] 
— alo +(2— ob) 2— (P+ Daf) — 20 — oa? — 2u | 
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-1/| iro], 
\ 





a¥= 2/(I1-o|+1) 
b*=(|1-c|-1) /(I1-ol +1) 


Fig. 5—Region of the a-b plane in which the matrices B and B — ATBA both 
may be positive definite. 


is positive definite if and only if det (B — A”BA) > 0 and 
we > BF — 3([ a. |? + [2 |”). (35) 
As is easily verified, the inequality det (B ~ A7”BA) > 0 is equivalent to 
—4y” + 4[1 + bd? — (1 — o)a® — ([a)? + [role 
+ {—(|Ai|? + [Ao|?)? + 2[1 + 6? — (1 — o)a*]([a]? + [Az]’) 
— alo + (2 — o)b]? — 4071 — a%(1 — oJ} > 0. (36) 


We view the left-hand side of the inequality (86) as a quadratic function 
in the variable » whose coefficients depend upon the values of the 
parameters a, a, b. Clearly, for any choice of these parameter values, 
(36) will not be satisfied for all large values of | 1 | . Thus, a necessary 
and sufficient condition for the existence of real values of » satisfying (36) 
is that the quadratic function on the left-hand side of (86) have distinct 
real zeros u, < uz. Moreover, if such is the case, (36) will be satisfied 
if and only if uw. < wu < pe. The zeros p,, we are given by (83) and, 
therefore, they are real and distinct if and only if 


[1 —b? — (1 —a)a’| > |al-|o + 2 — od. (37) 


We now prove that for any given value of c, the values of the param- 
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eters a, b specified by points lying within the open triangular region 
of Fig. 2, and which satisfy (87), are those (and only those) values 
of a, b specified by points lying within the open crosshatched region 
of Fig. 5. 


We begin by first showing that there exist no such values of a, b for 
which 1 — 6? — (1 — a)a’ < 0. Let us assume that this inequality holds 
for some value of c. Then, since 1 — 6” > 0, it follows that ¢ < 1. 
Now, either 


—o/(2—0) Sb <1, (38) 
or else 
—1<b < —o/(2 — 0). (39) 
If (88) holds, then (37) is equivalent to 
—-1+0?+ (1 —a)|al|? > |a| [eo + (2 — od], 


or 
(6-—|a|—1[b- A —o)lal+1]>0. (40) 
If, however, (39) holds, then (37) is equivalent to 
(b+ la|—Db+d—o)l|al+1)>0. (41) 


By considering first only nonnegative values of a, and then considering 
only nonpositive values of a, it is easy to use Fig. 5 and, by inspection, 
determine that there exist no values of the parameters a, b specified 
by points lying within the triangular region, such that both (38) and (40) 
hold. Similarly, it is easy to verify that the same is true regarding 
inequalities (39) and (41). 

We now assume that the parameters oc, a, b are to be chosen such 
that 1 — b” — (1 — o)a” = O. Then there are three cases to consider: 

If c = 1, it follows that ¢ + (2 — o)b > 0 and hence (87) is easily 
shown to be equivalent to 


6+ lal|/—Dfb-—|1—cl-Jal +1 <0. (42) 
If « < 1 and (88) holds, then it follows that (37) is equivalent to 
(6+ |a|—Dib+]1—c¢|-|al| +1] <0. (43) 
If ¢ < 1 and (89) holds, then it follows that (37) is equivalent to 
(Dees f= DB |r| |) a Le: (44) 
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By first considering only nonnegative values of a, and then considering 
only nonpositive values of a, it is easy to use Fig. 5 and, by inspection, 
determine that the inequality (42) is satisfied if and only if the values 
of the parameters a, b are determined by points lying within the open 
crosshatched region of Fig. 5. Similarly, the inequalities (38) and (48), 
or the inequalities (89) and (44), hold if and only if the values of the 
parameters a, b are determined by points lying within the open cross- 
hatched region of Fig. 5. 

It can easily be shown that for any given value of o, and any values 
of the parameters a, b specified by points lying within the open cross- 
hatched region of Fig. 5, it follows from up > yp, that the inequalities (34) 
and (35) also hold. We omit the details of the algebra. 0 


Proof of Theorem 1: Let the Liapunov function V be defined for all 
z2¢G = E’ by (31) and (32) with the values of a, a, b, 4 assumed to be 
such that both of the matrices B and B — A’BA are positive definite. 
It is clear that the equations 


2"(B — A*BA)z = ¢, c>0 (45) 


define a family of concentric ellipses, centered at the origin @ in the 
a-y plane [where z = (2, y)”]. The origin also lies between the two 
parallel straight lines bx + ay = -(1 — «), each of which is tangent 
to exactly one (in fact, the same one) of the ellipses (45). Thus, there 
is a unique value of c* > 0 such that 


c* = min {z7(B — A*’BA)z: bx + ay = &(1 — 0}. 
Let the function W be defined for all ze G = E” by 
W(z) = min {z7(B — A’BA)az, c*}. 


Thus, W(z) is defined by the positive definite quadratic form 
z"(B — A’BA)z for all points lying within the ellipse z27(B — A” BA)z = 
c*, and W(z) is defined by W(z) = c* for all other points in the z-y plane. 

It is clear that for each value of »“*” for which (8) holds, the points of 
the z-y plane determined by br + ay + v“*” = 1 lie on the opposite 
side of the line br + ay = 1 — ¢ from the ellipse z7(B — A’ BA)z = c*. 
The situation is similar regarding the points of the x-y plane determined 
by bx + ay + v“*” < ~1 and the line br + ay = —(1 — e). See Fig. 6. 
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It is an elementary result of analytic geometry that a general second- 
degree equation of the form ax” + bay + cy? + dx +ey+f=0 
represents an ellipse if and only if b” — 4ac < 0. It follows that, if we 
consider the constant-AV loci in the be + ay + v“*” > 1 region, 
and in the bx + ay + »“*” < —1 region of the x-y plane, a necessary 
and sufficient condition for these loci to be arcs of concentric ellipses is: 


4u” — 4[1 — ([ Ar |? + | de Pe + [o?a? — 2(| a |? + | Ae |?) 
+ (a P+ P)71< 0. (48) 


Furthermore, since AV (k, z) is continuous in z, and since the values of 
AV (k, z) along the lines bz + ay + v“*” = +1 are given by AV(k, z) = 
—z"(B — A*BA)z, with B — A’BA a positive definite matrix, it is 
clear that when (48) is satisfied, the constant-AV curves specified 
by (46) (temporarily extending the domain of definition of that function 
to the entire z-y plane) are of the type shown in either Fig. 7a or Fig. 7b; 
that is, the line bx + ay + »“*” = 1 intersects only certain constant-AV 
curves—in particular, only certain such curves for which the value of 
AV is negative. Thus, the center of the ellipses is situated to one side or 
the other of the line br + ay + v“*” = 1 in such a manner that the 
constant-AV ellipses for which AV is positive are not intersected by 
the line. Considering, however, that when AV(k, z) of (46) is evaluated 
at z = 6 its value is positive, it is clear that Fig. 7b is impossible. Thus 
[applying exactly the same reasoning to the constant-AV curves defined 
by (47)], it follows that whenever the inequality (48) is satisfied, the 
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Fig. 7—Possible shape of constant-AV curves defined by equation (46). 
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function AV(k, z) achieves its maximum for bz + ay + v“*” = 1 on 
the line bx + ay + »“*” = 1, and similarly for the behavior of AV(k, z) 
in the br + ay + v“*” < —1 region of the z-y plane. It follows, there- 


< 
fore, that there exists c’ = c* > 0 such that for bx + ay + v“*” = 1, 


AV(k, z) S —c’ S —c*® = —Wz). (49) 
Similarly, there exists c’’ = c* > O such that for bx + ay + v%*” < —1, 
AV(k, 2) S —c’ S —c* = —W(2). (50) 


We have shown that, with the functions V, W defined as specified 
above, the hypotheses of Lemma 1 are satisfied. Thus, the solution 
of (7) satisfies lim,... || 2“ || = 0 provided that the values of o, a, b, u 
are such that B and B — A’BA are positive definite, and provided that 
(48) holds. 

We view the left-hand side of the inequality (48) as a quadratic 
function in the variable » whose coefficient values depend upon the 
values of the parameters a, a, b. Clearly, for any choice of these param- 
eter values (48) will not hold for all large values of | u | . Thus, a neces- 
sary and sufficient condition for the existence of real values of pu satis- 
fying (48) is that the quadratic function on the left-side of (48) has 
distinct real zeros £; < fiz. ; moreover (48) will be satisfied if and only 
if fi <p < fi.. The zeros fi; , fi. are given by 


Ay, = $1 — (il? + [e?) V1 — 07a’). (51) 


They are real and distinct if and only if the inequality (10) holds. 

According to Lemma 2, for any given value of o the matrices B and 
B — A’BA are positive definite for values of a, b that are specified by 
some point lying within the open triangular region of Fig. 2 if and 
only if wi < wu < me, where py, , we are specified by (33). Thus, assuming 
that o, a, b satisfy (10). and (11), there exists a value of » such that B 
and B — A’BA are positive definite and such that (48) holds if and 
only if the open intervals (u;, uw.) and (@,, fz) overlap. That is, if 
and only if uw. < f2 and f, < pw.. Using (33) and (51), these last two 
inequalities are easily shown to be equivalent to (12). O 
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Three-Dimensional Small-Signal Analysis 
of Bipolar Transistors 


By J. L. BLUE 
(Manuscript received November 15, 1971) 


One-dimensional transistors are well-understood today; computer 
techniques for detailed large-signal and small-signal analyses are available. 
Real transistors are three-dimenstonal, however, and lateral effects are 
only understood qualitatively. Accurate modeling of lateral effects cannot 
be accomplished without quantitative analyses of three-dimensional trans- 
istors. Unfortunately, even the simplest analysis of lateral effects leads to a 
partial differential equation. In this paper, a fast and accurate numerical 
technique ts used to solve the partial differential equation. This makes 
feasible a three-dimensional small-signal analysis of transistors operating 
in the low-injection regime. 

Calculated h-parameters for a high-frequency, double-diffused, silicon 
transistor are in good agreement with experimental values. 


I, INTRODUCTION 


One-dimensional transistors are well-understood today; computer 
techniques for detailed large-signal and small-signal analyses are 
available." A new charge-control model,’ which is quite promising for 
use in circuit analysis programs, has been developed with the aid of 
insight obtained from the large-scale computer analyses. 

However, real transistors are three-dimensional and lateral effects 
are only understood qualitatively. Accurate modeling of lateral effects 
cannot be accomplished without quantitative analyses of three-dimen- 
sional transistors. Unfortunately, even the simplest analysis of lateral 
effects leads to a partial differential equation [namely (7)]. Except for 
certain very simple transistor geometrical configurations, an analytic 
solution to the partial differential equation is not possible, and up to 
now, numerical solutions have been at best difficult and expensive in 
computer time. 

In this paper, a fast and accurate numerical technique“ is used to 
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solve the partial differential equation. This makes feasible a three- 
dimensional small-signal analysis of transistors operating in the low- 
injection regime. 

It is hoped that with the aid of this technique, quantitative under- 
standing of lateral effects in transistors can be achieved, and an accurate 
but simple model can be developed. 

In Section II, the partial differential equation for the potential in the 
base region of the transistor is derived. Section III briefly reviews the 
solution technique. In Section IV, the calculation of h-parameters from 
the partial differential equation solution is described. Section V presents 
the results from a sample calculation, and compares them with experi- 
mental values. 


II. TRANSISTOR MODEL 


In this section, a discrete bipolar transistor with collector contact 
made to the substrate is analyzed. (A planar bipolar transistor could 
equally well have been chosen.) The small-signal behavior of the 
transistor in one dimension is modeled and a partial differential equation 
in the other two dimensions is derived. 

Figure 1 is a cross section perpendicular to the surface of a typical 
discrete transistor with diffused base and emitter. A partial differential 
equation for ¢, the potential in the base region, will now be derived. 
Let z be the coordinate perpendicular to the surface of the transistor. 
Let ¢(2, y) be the potential in the base, p(x, y) be the charge density, 
and J(x, y) be the current density parallel to the transistor surface; 
¢, p, ¢, and J are average values, and are taken to be independent of z. 

Ohm’s law is 


Ve = —*5, (1) 


where o is the conductivity, and the continuity equation for the charge is 


BASE EMITTER 
CONTACT CONTACT 





VILLA LLL LAL LLL LLL 
COLLECTOR CONTACT 


Fig. 1—Cross section of transistor. 
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VJ+s2 =0. (2) 


The dp/dt term is composed of capacitatively and resistively injected 
charge into the base, and will be considered shortly. Taking the diverg- 
ence of (1) and substituting (2), one obtains 


Op 


dy ae 2 1 
x al (3) 


2 
Ve= ar oy? = 

In order to calculate dp/dt, the net rate of the charge injection into 
the base region, a specific model is necessary. This paper will consider 
only a simple model, since the purpose is to illustrate the technique 
rather than to model exactly some particular transistor. Two separate 
sets of simplifying assumptions will be made, one set in calculating 
1/c dp/dt to set up the partial differential equation, and the second set 
in calculating small-signal parameters from the solution of the partial 
differential equation. The second set will be discussed in Section IV. 

It is assumed that the emitter doping is sufficiently high that the 
potential on the emitter side of the emitter-base junction is constant. 
This constant is taken equal to zero. It is similarly assumed that the 
potential on the collector side of the collector-base junction is constant. 
The base region is divided into two parts—the active base region, directly 
under the emitter diffusion, and the passive base region, the remainder 
of the base. The conductivity of each of the two regions is taken to be 
constant, but the two conductivities are not equal. Finally, low injection 
is assumed so that current and charge injected by the emitter-base 
junction is uniform in the active base region. 

With these assumptions, the rate of charge injection into the base may 
be calculated. Equations (4), (5), and (6) may be interpreted with the 
aid of the equivalent circuit of Fig. 2, which is essentially a hybrid-r 
model of a transistor neglecting base resistance corrections. The charge 
injected into the base at point (x, y) may be obtained by calculating 
the base current for the circuit in Fig. 2; such an equivalent circuit 
obtains at each point of the active base. Only the circuit elements which 
contribute to charge flow in the z-direction are included; the lateral 
charge flow is calculated by solving the partial differential equation (8). 
Thus, it is not necessary to include in the circuit of Fig. 2 the usual 
hybrid-z base resistances. In the equations below, ¢(z, y) = Vsx(2, y) 
is the only quantity that depends on z and y, for (z, y) in the active 
base region. The rate of charge injection into the base has three compo- 
nents. 
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$(X,Y)= Vee Vee 
B O 


CE 





E 


Fig. 2—KEquivalent circuit used to calculate charge injection into the base and 
current injection into the collector, at a point (x, y) of the active base region. 


(c) The dynamic resistance and capacitance of the emitter-base 
junction contributes to dp/dt only in the active base region. 


(22), “7 (c, % oe Cx) , (4) 
where W, is the width (z-direction) of the active base region; 
C; is the dynamic capacitance per unit area of the emitter-base 
junction, and includes the ordinary junction capacitance; and 
G, is the conductance per unit area. 

(22) The collector junction capacitance contributes 


(Zi), = Beane tee 6 


where Cy is the capacitance per unit area of the collector-base 
junction, veg is the potential of the collector side of the collector- 
base junction, and W in the active base region is W, and in the 
passive base region is Wp, the width of the passive base. 

(771) Base-width modulation in the active base region contributes 


s) <hlere de 
(22 ‘ 1 Wa G ae C at (p Ver), (6) 
where the conductance G’ and capacitance C’ are each per unit 
area. Generally C’ and G’ are not important. In addition, base- 
width modulation is the source of G) , a conductance per unit 
area. 


In the passive base region, the only charge injected is through the 
capacitance Cc . 
Collecting the three terms, assuming frequency dependence e‘“', 
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defining sheet resistance Ry = 1/Wac, and Rp = 1/Weep, the following 
partial differential equations for the potential are obtained. 

Active base: 
V’¢ = RalGe + G’ + w(Co + Ce + Cle 


me: R,[G’ + 1w(C’ + Co)lucz ‘i (7) 
Passive base: 


V’e = twRpCo(e — ver). (8) 
For typical parameters, 
Gr > Go, 
Cz >> Co + C’, (9) 
Co >> C": 
then in the active base the equation simplifies to 
V'o = Ral(Ge + twCz)e — (4G! + iwC ever). (10) 


This equation was used for numerical calculations. 
Equations (8) and (10) are of the form 


V*e = 7 (¢ — Vex), (11) 


where y’ is frequency-dependent and takes on different values y2 and y2 
in the active and passive base regions, but otherwise is independent 
of x and y. 

Any other modeling of the z-direction of the transistor is equally 
usable, provided that it leads to equations of the type 


Ve = ag 7 b, (12) 


where a and Db are independent of x and y in each of several regions. 

Figure 3 shows a typical double-base-stripe discrete bipolar transistor 
from the top. The active base region is under the emitter diffusion, 
marked E; this region is represented by eq. (10). Under the two regions 
marked B, ¢ is constant and g = vg, . The remainder of the region 
shown is the passive base region; in this region, eq. (8) holds. Around 
the passive base region is the boundary of the base diffusion. A boundary 
condition is that no current flows through this boundary; the current 
flowing through the boundary is proportional to the normally-directed 
derivative of the potential, so on this boundary 


0 
n-Ve =7 = 0, (13) 
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Fig. 3—Top view of transistor used in calculations. 


where n is the outer-directed unit normal to the boundary. At the edge 
of the emitter diffusion, the dividing line between the active and passive 
base regions, the boundary conditions are that the potential be contin- 
uous and that the total current flow be continuous. These conditions are 


Pa = YP; (14) 


1 1 
a(Z Vea) = a(t Ver) , (15) 


where n is the unit normal outward from the emitter diffusion, ¢, refers 
to the potential just inside the active base, and gp refers to the potential 
just inside the passive base. Equation (14) neglects the edge injection 
of current from the emitter-base junction; such injection could easily 
be included. It leads to a boundary condition, similar to eq. (15), but 
involving both ¢ and d¢/dn. 


IiI. SOLUTION TECHNIQUE 


This section describes an economical method for solving the coupled 
partial differential equations (8) and (10) with boundary conditions (13) 
through (15). The usual way to solve such equations is by finite dif- 
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ferences. A square or rectangular grid is superimposed on the 2, y plane 
and derivatives at the grid vertices are approximated by finite differences. 
A large number of linear algebraic equations are generated and solved 
iteratively. While very powerful, finite difference methods have several 
disadvantages for the problem at hand. 

It may not be easy to get the transistor boundaries (emitter diffusion, 
base contacts, and base diffusion) to fall on grid lines, so that more 
complicated programming is necessary. The areas in which the equations 
must be solved are frequently odd-shaped (see Fig. 4). In such cases, 
the usual iteration schemes are not guaranteed to converge and it may 
be necessary to develop an efficient scheme. 

The method** developed for solving equations like (11) does not 
employ a grid and does not require iteration to solve the set of linear 
algebraic equations. The method will be briefly described for a single 
equation, 


Ve =e, (16) 
holding in some area A with boundary I. The area A may be any shape; 


it may even be multiply connected. For (16), Ref. 4 derives the exact 
expression, 


mo(s’) = P i Ee ee) + @-r/NK ones | ds. (17) 


Here s’ is any point on the boundary (except at a corner point), P 
denotes the Cauchy principle value, Ky and K, are modified Bessel 
functions of the first kind,’ and r is the vector from s’ to s. Points inside 
the region A are not involved; however, if s’ is inside A, (17) can be 





2 


_ Fig. 4—“Exploded”’ view of one-fourth of transistor. Left—passive base region, 
right—active base region (under emitter). Numbers indicate boundary conditions 
applying (see text). 
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used if + is replaced by 27. Thus, if y and d¢/dn are known on the 
boundary, ¢ may be found inside the boundary. 

To find ¢(s) and d¢(s)/dn on the boundary, (17) is regarded as an 
integral equation. This and the boundary conditions for the original 
partial differential equation suffice to determine the solution. In prin- 
ciple, an accurate numerical solution is quite simple, although the 
computer programming is rather complex. Various methods may be 
used to solve (17) plus the boundary conditions; the method used in 
Ref. 4 will be described here. The boundary I is divided into a number 
of straight-line segments; on each segment ¢(s) and d¢(s)/dn are approx- 
imated by polynomials of any desired order with unknown coefficients. 
Using geometrical information about I’, and the expansion of K, and K, 
in terms of logarithms and polynomials,” the integrals in (17) may be 
done symbolically.* Equation (17) cannot in general be satisfied exactly 
at each point s’ along the boundary; N points are chosen at which to 
satisfy (17) exactly, where N is the total number of unknown coefficients 
in all of the polynomials approximating ¢ and d¢/dn, on every part of I. 
Then (17) is replaced by N linear algebraic equations for the poly- 
nomial coefficients. For typical transistors, N is no more than 50 or so, 
and therefore the equations may be solved directly without iteration 
by Gaussian elimination; N = 50 gives an accuracy of about one percent 
for the numerical example considered later. 

When the polynomial coefficients have been found, ¢ and d¢/dn are 
known on the boundary and ¢ may be found inside A if it is desired. 
For the transistor analysis, however, ¢ is not needed inside A; it is only 
needed on the boundary. Although the program is quite complex, the 
computer time is quite short. For Laplace’s equation, the boundary 
integral equation method is considerably faster than finite difference 
methods.* 

The two coupled partial differential equations, (8) and (10), may be 
solved in a similar manner. Fig. 4 shows the areas of Fig. 3 in which 
each of the two equations can be used. Because of the fourfold symmetry 
of Fig. 3, only one-fourth of the transistor need be considered, and only 
one-fourth is shown. On the left is the passive base region, in which (10) 
holds; on the right is the active base region, in which (8) holds. The 
numbers on the line segments indicate the boundary conditions on each 
segment. On segments numbered 1, the base contacts, (g — vez) is 
constant; on the exterior of the base diffusion, segments numbered 2, 
and symmetry lines, d¢/dén = 0; and on segments numbered 3, the 
boundary of the passive and active base regions, (14) and (15) hold. 
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IV. CALCULATION OF h-PARAMETERS 


With a fast and accurate method of solving the coupled partial 
differential equations (8) and (10), finding the small-signal parameters 
of the transistor is relatively easy. This section describes the method 
for determining common-emitter h-parameters. 

Common-emitter h-parameters are defined by 


Var = Ayte + Ayder = hietg +h, ver, (18a) 
te = hate + hoder = hyetz + hoder. (18b) 


In order to find all four h-parameters at a given frequency, two 
potential solutions are necessary. The first solution gives hj, and ho, 
and assumes vez to be zero. With veg = 0, vey = Aiiip and ig = Aoiiz . 
The base contacts are taken to be at unit potential, v3, = 1. The bound- 
ary integral equation method is used to solve (8) and (10) with boundary 
conditions (13), (14), (15), plus ¢ = 1 on the base contact (lines num- 
bered 1 in Fig. 4). In Fig. 4, the output is d¢/dn on lines numbered 1, 
where ¢ = 1; ¢ on lines numbered 2, where d¢/dn = 0; and both ¢ and 
dy¢/dn on lines numbered 3. The current flowing into the base contact 
is just 


- if d¢(s) 
tz = R, a ds, (19) 


where the integration is over lines numbered 1 in Fig. 4. Since d¢/dn 
is approximated by a polynomial on each of the sides, the integral 
in (19) is trivial. This gives hy, = 1/2, . 

To find h., , the collector current must be found, which involves the 
modeling of the transistor in the z-direction. Again, the simple model 
used may be interpreted with the aid of Fig. 2. Transit-time effects in 
base and collector will be neglected. The z-directed collector current 
flowing through a small area Ax Ay has three components: 

(z) Diffusion of injected charge across the active base region. 
G, is a conductance per unit area. 


(Atc)1 = Gup Ax Ay. (20) 
(22) Injected current from collector-base junction capacitance. 


(Mic)e = —Co = (e — von) At Ay. (21) 


(211) Injected current from base-width modulation in the active base 
region. 


898 THE BELL SYSTEM TECHNICAL JOURNAL, APRIL 1972 


: re) 
(Atc)s = Govcr Ax Ay 7 (@" + cr Ne oe Vor) Ax Ay. (22) 
From the passive base region, current flows to the collector only 
through the capacitance Cz . 
The total collector current is 


te > Ty ze Base Gn 7 oe 7 ta(Ce + ole 
+ [(Go + G’) + tw(Co + CYoce} dx dy 


- ff iwo(e — ver) dx dy. (23) 
Passive Base 


In the calculations, G’ was neglected, Go + G’ = G, were combined, 
and C’ was neglected. 

Thus it is necessary to do an area integral of the potential. Since the 
factors multiplying ¢ are independent of x and y, a two-dimensional 
numerical integration may be avoided. Consider one of Green’s bound- 
ary-value formulas, 


2 2 2 ag _ aa) 
[[ eve -2V'@ away = § (6% os *\ds, (24) 


where the line integral is over I, the boundary of the area A. This is 
true for any well-behaved functions 9(z, y) and G(z, y). Suppose that 
y obeys (16). In the area integral, substitute y’o for Vo to obtain 


2 2 23 dg 20) 
-f[ «ve-vo away =4 (¢%-—% ds. (25) 


Now G may be chosen so that 


VG —-VGE =1, (26) 
or 
G = —1/7’, 
to give 
__1¢ % 
I g dx dy = =P on ds. (27) 


Thus the area integral in (23) may be replaced by a line integral; 
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since 0¢/dn is a known polynomial, the integral is again trivial. In this 
way ic may be found, and ha, = tc¢/iz . 

In order to find hi, and hy: , 7g is assumed to be zero. With ig = 0, 
Vee = hycr and tg = hodcz . The potential at the base contacts, vzgz , 
is unknown; since 7; = 0, d¢/dn = 0 at the base. For the second solution, 
the boundary integral equation method is used to solve (8) and (10) for 
(y — Vex) with boundary conditions (13), (14), (15), plus ¢ = vg, = 
unknown constant and d¢/dn = 0 on the base contact (lines numbered 1 
in Fig. 4). vcz is assumed to be 1. The output is ¢ on lines numbered 1 
and 2, where d¢/dn = 0, and both ¢ and d¢/dn on lines numbered 3. 
Then vgz = hy. is found directly from ¢ on the base contact, and 7¢ may 
be found as before to give hz. = tc. 


V. RESULTS FROM SAMPLE CALCULATION 


Experimental h-parameters were available for a double-diffused, 
double-base-stripe silicon transistor, similar to that shown in Fig. 3. 
The data were taken with vey = 1 V, and J; = 10 mA (approximately 
230 A/cm’). In the theoretical calculations, there are seven parameters: 
Re,Cco,Ra,Gn,Gz,Cz,and G,. Two of these, Rp and G,, , were not 
adjusted. Rp , the sheet resistance of the passive base, was given its 
nominal fabrication value. G,, , which is essentially a transconductance 
per unit area, was set equal to Iz/(A,kT/q). Ax is the emitter area. 
The other five parameters were chosen so that the experimental and 
calculated h-parameter matched at low frequencies (below 1 MHz). 
No additional adjustments were made to fit the high-frequency data. 
The parameter values used in the calculation are given in Table I. For 
the calculations, the nominal fabrication geometry was used. 

Complex-plane plots of the four common-emitter h-parameters are 
given in Figs. 5 through 8. The solid lines are drawn from the experi- 
mental data, only a few points of which are shown (solid circles). Fre- 


TABLE I—PARAMETERS USED IN Firrine h-PARAMETERS 





Rp, sheet. resistance in passive base, 170 ohms/[] 

Ry, sheet resistance in active base, 6000 ohms/[ ] 

Gz, conductance per unit area of emitter-base junction, 0.00051 mho/mil? = 
79 mho/cm? 

Cz, capacitance per unit area of emitter-base junction, 21 pF/mil? = 330 uF /cm? 

Gi, low-frequency limit of ho., 0.0003 mho 

Co, capacitance per unit area of collector-base junction, 0.226 pF/mil? = 0.0350 

em: 
Gn, ‘“transconductance’’ per unit area, 0.058 mho/mil? = 8900 mho/cm? 
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100 200 300 400 


-200 





Fig. 5—Input impedance, hi: or hie, in ohms. The solid line and circles are experi- 
mental, and the open circles theoretical. Frequency is a parameter along the curve, 
and selected frequencies are given in MHz. 


quency is a parameter along each curve; selected frequencies are indicated 
in MHz. At these frequencies, the calculated h-parameters are plotted 
as open circles. 

At the higher frequencies, the discrepancy between experimental and 
calculated results becomes significant. There are several reasons why 
this is to be expected. In the first place, the experimental data may be 
in error, since h-parameter measurements are considerably more difficult 
to obtain at high than at low frequencies. However, it is not necessary 
to invoke this explanation (the traditional one for theorists) for the 
discrepancies to be understood. 

There are two other sources of discrepancy. The first is that the simple 
model used for the transistor is not adequate at high frequencies. 
“Second-order” effects such as transit-time delay in the base and 
collector were ignored; other effects which are small at low frequencies 
could also be important. At high frequencies, the exact shape and size 


ry 20 40 60 80 100 120 
100 
80.3 
30 
~20 
1 
a5 10 MHZ 
-60 3 


Fig. 6—Forward current gain, ho or hse, dimensionless. 
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Fig. 7—Reverse voltage transfer factor, hy. or h,., dimensionless. 


of the transistor is important and the measured transistor may certainly 
differ somewhat in geometry from the nominal. The effect of geometry 
may be seen in the plot for hos . The loop in hz. at high frequencies is 
due to the geometry of the transistor. Calculations done with other 
geometries show that, as the base and emitter stripes are made longer, 
the horizontal width of the loop shrinks; the loop is absent for a transis- 
tor with infinitely long stripes. 

Secondly, at high frequencies the parasitic effects of the header and 
the encapsulation become important and need to be accurately modeled 
if the experimental data are to be matched. Sample calculations with 





ie) 0.002 0.004 0.006 0.008 0.010 


Fig. 8—Output conductance, he: or hoe, in mhos. 
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representative values showed that header parasitics had at least as large 
an effect on the h-parameters as the high-frequency discrepancy. 
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Stability of Distributed Systems With 
Feedback via Michailov’s Criterion 


By G. C. REIS* 
(Manuscript received November 11, 1971) 


This paper is based on results dered during a stability study of the 
Saturn V rocket for which it was necessary to validate the use of Nyquist’s 
encirclement-counting technique in distributed systems. An outline of the 
paper 1s as follows: Certain results concerning the finiteness of the number 
of zeros of polynomials in s and e* are shown in Theorem 1 and its cor- 
ollaries. Theorem 2 1s a generalization of Michatlov’s Criterion. Simplifying 
assumptions, usually valid in practice, yield a simplified test to determine 
af “encirclement-counting”’ is a valid stability test [equation (20)]. The 
results are reformulated for an open-loop analysis. Various aspects of the 
theory are shown by three examples based on an electrical equivalent of a 
simple single-engine, liquid-fuel rocket. 


I. INTRODUCTION 


Liquid-fueled rockets can exhibit a peculiar type of instability due to 
self-sustained longitudinal oscillations. Since the rocket then stretches 
and shrinks longitudinally, it behaves like a pogo-stick, which has 
resulted in the nickname POGO for this type of instability. 

To see how this phenomenon arises, consider the simple diagram 
shown in Fig. 1. The chain of events which can cause POGO is initiated 
by a random variation in thrust of the engine. This thrust variation 
causes the rocket structure to oscillate in its natural modes. The pressure 
in the fuel tank thus varies. This pressure variation is propagated down 
the fuel feed line, resulting in a variation of fuel flow into the engine. 
Since the thrust of the engine is proportional to the rate of fuel entering, 
the loop is completed, and instability results if this resulting thrust 
variation aids the original random variation which initiated the chain 
of events. 


* This work was done when the author was with Bellcomm, Inc., and therefore 
was performed under NASA contract NSW417. 
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———ROCKET 


—-——FUEL TANK 


——-— FUEL FEED LINE 


———— ENGINE 


Fig. 1—Simple model of a liquid-fuel rocket. 


In modeling this physical situation, it is customary to perform a modal | 
analysis of the rocket structure, and retain only the most significant 
modes. This results in the distributed structure being replaced by a 
lumped approximation. Although the same technique could be used 
to lump the feed-line, there are many reasons for desiring to keep this 
element as a distributed parameter. There is no great difficulty in doing 
so, since the fluid equations which govern the feed-line are of the same 
form as electrical transmission lines. 

An equivalent circuit of Fig. 1 would then be Fig. 2, where V,(s), V2(s) 
are the Laplace transforms of the pressure variations at the top and 
bottom of the feed-line, respectively, and J,(s), I.(s) represent the 
transform of flow variations. £,(s) is then the random pressure variation 
at the top of the line due to the assumed random thrust variation above. 
Y(s) represents the hydro-mechanical impedance of the feed-line output, 
and G(s) includes the structural feedback. To see the type of equations 
which will be of concern for a stability analysis, assume that losses can 
be neglected in the feed-line. 





Fig. 2—Electrical equivalent of Fig. 1. 
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The system equations can be written in matrix form as: 


[ cosh ks —Z,sinhks —1 a Vi(s) | ro | 
; oe 
-F, sinh ks cosh ks 0 —1 |) 1,(s) _ 0 | (1) 
L 0 0 Y(s) —1iji2.(s)! ££ O J 














where Z, (a positive real number) is the characteristic impedance of the 
line and k (a positive real number) is the ratio of line length to wave 
speed in the line. The determinant (A(s)) of the matrix appearing in (1) 
is of interest in a stability analysis. It can easily be computed to be 


A(s) = cosh ks — G(s) + Z,Y(s) sinh ks. (2) 


In general, neither Y(s) nor G(s) need be rational (especially in the case 
of multiple engines) but for simplicity, suppose that 


Gs) = ae Z.¥(8) = Dia (3) 


where Ng, De, Ny, Dy are polynomials in the complex variable s. 
Then (2) can be written as 
A(s)[De(s)Dr(s)] = [De(s)Dy(s)] cosh ks — Ne(s)Dy(s) 
+ Dg(s)Ny(s) sinh ks. (4) 


In the following sections of this paper we will be concerned with 
polynomials in the two complex variable s and e*. We now show that (4) 
can be considered as such. Thus multiply (4) by 2e~*"* to give 


F(s) = Qe"* A(s)[De(s) Dy(s)] = De(s)[Dr(s) + Ny(s)] 
— 2N¢(s) Dy(sje™ + De(s)[Dr(s) — Ny(sler" (5) 


or 


F(s) = 2d Rise (6) 


where each F,(s) is a polynomial in s. 

This paper will also be concerned with open-loop and closed-loop 
expressions, and will assume that the quantities of interest will be of 
the form of (6). To show that this is true in our present example, assume 
that G(s) “closes the loop” and solve for the “line transfer function” 
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V.(s)/V.(s). The result is 


V2(s) pot | 7 (7) 
Vi(s) cosh ks + Z,Y(s) sinh ks 





Hence the “loop gain,”’ G(s), is simply G(s) times (7), or 
G(s) 
cosh ks + Z¢ Y(s) sinh ks 


ig Dy(9)N a(s) : (8) 
~ Dy(s) De(s) cosh ks + De(s)N y(s) sinh ks 


Go(s) = 


The denominator of (8) is seen to be of the form of A(s), and hence can 
be made to look like (6). 

Finally, we wish to remove the simplifying assumption that Y(s) 
and G(s) are rational. This is desired since an actual engine is not only 
fed liquid fuel, but also liquid oxidizer. Thus even a single-engine rocket 
has two feed-lines. (The Saturn V has five engines, for a total of ten 
feed-lines.) The structure of a model for rockets of this complexity 
would be that of Fig. 2, repeated once for each feed-line, with suitable 
interconnections through lumped (i.e., rational) transfer functions. 
Thus the G(s) and Y(s) of Fig. 2 will be the ratio of sums of powers of s 
and e*. As such they can be combined to yield forms such as (6). 

A final constraint on the stability analysis is that it is required that 
the analysis be of the conventional open-loop type using Nyquist’s 
Criterion. For lumped systems this presents little difficulty since one can 
always make open-loop measurements at as high a frequency as neces- 
sary to guarantee that all singularities of the transfer function are 
included. Furthermore, for rational functions, no difficulty is encountered 
in closing the contour in the right-half s-plane. With distributed systems, 
however, it is possible that the gain becomes periodic for large magnitude 
of s, and care must be exercised in determining closed-loop stability 
via open-loop gain plots. What is desired, therefore, is a set of conditions 
under which the conventional ‘encirclement counting” technique for 
lumped systems, remains valid for distributed systems of the type 
described. 

There are two important techniques for determining whether a 
polynomial in the two complex variables s and e*’ has any zeros for 
Re [s] = 0. These are the Pontryagin Criterion’ and the Michailov 
Criterion.” It is of interest to see if these criteria can be applied to the 
ratio of such polynomials in order to determine stability of closed-loop 
gain. In a previous note’ it was shown that this is not feasible for criteria 
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of the Pontryagin type. In this paper we are able to develop a stability 
criterion of the Nyquist type from Michailov’s Criterion for a large 
class of distributed-parameter systems, in particular, for a large class of 
transmission line systems with feedback. 

An outline of the paper is as follows: Certain results concerning the 
finiteness of the number of zeros of polynomials in s and e’ are stated 
as Theorem 1 and its corollaries. Theorem 2 is the desired generalization 
of Michailov’s Criterion. Simplifying assumptions, usually valid in 
practice, yield a simplified test to determine if ‘“encirclement-counting”’ 
is a valid stability test [equation (20)]. The results are reformulated 
for an open-loop analysis. Three examples show various aspects of the 
theoretical analysis. The Appendix includes a statement of Pontryagin’s 
Criterion suitable for use in the present paper. Also included are the 
proofs of the two theorems and a derivation of some conditions under 
which Michailov’s Criterion can be simplified. 


II. MICHAILOV’S CRITERION 


As a starting point we consider an equation of the form 


Gz) = eS se G;,;z'e° '* = 0 (9) 


where @;; are complex and w; are real. If any of the w,; were negative, 
we could multiply G(z) by e'**'*, where w, is the most negative of the ws. 
This would not change the zeros of G(z), so we assume 0 S a, < w, < 
+++ <w,,. Dividing by e””’ and letting @,,_;41,; = a:;; , we transform (9) 


into 
F(z) = >», De a,;z'e (10) 
#=1 7=0 
where 7; = Wm — Wm—i+1 > Ofor? = 2,3,---,mandr, = 0. [To relate 


this to the Pontryagin Criterion, note that if the w,; are rational (which 
can always be assumed in a practical situation) then a suitable scaling 
of the z variable will make G(z) of (9) into a polynomial like H(z) of the 
Pontryagin Criterion. ]* 

Before continuing, it should be noted that a proof of the Michailov 
Criterion for exponential polynomials has been presented in the litera- 
ture.” However, this proof assumed that 


m 


|-ai | > y | ain |- (11) 


i= 


* See Appendix and Ref. 1. 
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As will be seen by the example to be considered later, in transmission 
line systems, (11) is almost never satisfied. Hence, a proof is desired 
which is free of this assumption. However, we do use the assumption 
that a,, ~ 0. That this is no loss of generality can be seen by the follow- 
ing considerations. It is clear that | r,,| 2 |r;|,7 = 2, --- , m. Then 
multiplying (10) by e’”* (which does not change the zeros) puts (10) 
into the Pontryagin form. If a;, is zero, the principal term is missing 
and we are finished with the stability study.* To aid in the subsequent 
development, let us rewrite (10) as 


F@ = > Qe") (12) 
where Q;(e"*) = 50., a;;e""*, or as 
F(z) = 2’F,{z) + pS 2'Q,(e7*) (13) 


where k is an integer between zero and n and 


PQ = De 'Oe). (14) 


In the Appendix we prove the following theorem: 


Theorem 1: If there exists a non-negative integer k S n such that F,(z) 
of (14) has at most a finite number of zeros on Re (z) = O,7 then F(z) of (18) 
has at most a finite number of zeros on Re (z) > 0. 


The following corollaries are of interest: 


Corollary 1: If there exists a non-negative integer k S n such that F,(z) 
of (14) has no zeros on Re (z) = 0, then F(z) of (12) has at most a finite 
number of zeros on Re (z) > 0. 


Corollary 2: If Q,(e"*) of (12) has no zeros on Re (z) = 0 then F(z) 
of (12) has at most a finite number of zeros on Re (z) > 0. 
[Corollary 2 follows since Q,(e*) = F,(z).] 


Let 

F@) = 1+ V@2F.@) (15) 
where 

He) = Yea» GE), (16) 


In the Appendix we derive theorem 2: 
* See Appendix and Ref. 1. 
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Theorem 2: The number of zeros of F(z) with positive real part is 


nv =2-24 @ + $2 @.@) 


+ sang (1+ vy) ~ arg (d+ Wi) 4) 


assuming that F(z) has no purely imaginary zeros and where A_.,(F(z)) 
is the net change in arg F(z) along the imaginary axis from —ty to +iy, 
and Ac(F,(z)) ts the net change in arg (F,.(z)) along contuor C, any contour 
outside the semicircle of radius R of Theorem 1. 


III. MICHAILOV’S CRITERION: SPECIAL CASE 


Theorem 2 is the desired statement of Michailov’s Criterion. To 
obtain tighter results, let us now assume that Q,(e *) has no zeros on 
Re (z) 2 O (.e., Corollary 2). Further, let the r; be rational and the 
a;; be real. By virtue of rational r; , Q,(e°*) is periodic in y, for z = 
x + zy. Let this period be P. By virtue of real a;; , replacing z by its 
conjugate results in Q,(e*) being replaced by its conjugate. Hence we 
need only consider the semi-infinite strip defined by x> O and P 2 y 2 0. 

Michailov’s Criterion can be simplified if it can be shown that Q,(e *) 
does not wind around the origin as z varies over a suitable C. We now 
consider this possibility. 


Q,(e *) 


> a;,e "(cos rjyy — 7sin 7,;y) 
j=l 


™m 


™m 
Qin + >> ane" cosryy — i >. a,,e 7% sin ry 
j=2 j=2 


Re [Q,] — ¢ Im [Q,]. (18) 


If either Re [Q,] or Im [Q,] does not vanish along C,, then Q, cannot wind 
around the origin. In the Appendix we derive sufficient conditions for 
this. 

We now assume that A,[Q,(e *)] = 0 and write the Michailov Criterion 
as 


N = 5-2 A .(P@) + arg (1 + vu) (19) 


ks 
2 


where A_.,,/2 is that part of the imaginary axis from 0 to zy. Thus N = Oif 
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and only if 
nT : 
A_w/2lF@)) = “y+ arg (1 + vy). (20) 


We remark that arg (1 + w(cy)) can be made close to zero for y suf- 
ficiently large. 


IV. MICHAILOV’S CRITERION APPLIED TO OPEN-LOOP ANALYSIS 


The Michailov Criterion, as well as its predecessor the Pontryagin 
Criterion, settle the problem of finding rhp zeros of polynomials in z 
and e’. In many engineering applications, however, this polynomial is 
not directly available, but a related ratio of such polynomials can be 
found. In the study of feedback systems, for example, an open-loop 
gain can be measured and it is desired to find the poles of the closed-loop 
gain. These latter poles are the zeros of the polynomial which results 
from adding the two polynomials whose ratio is the open-loop gain. 
Stability has been determined for nondistributed systems by counting 
encirclements of the open-loop gain along the imaginary axis. What we 
propose to do next is to provide a similar criterion for the distributed 
parameter problem. Thus let F(z) = D(z) + N(z). Then 


Ar(D@) + N@) 


a(o@y(1 + 2) 


Il 


a(D@) + (1 +¥8). 


Let the contour I be composed of a portion of the imaginary axis w 
and another (possibly semicircular) contour C,, such that I encloses 
all zeros of D(z) + N(z). Then 


Ar(D@) + N@) = Ao(D@) + a.(1 + ro) + Ad + ma). (22) 


It is the term 
N Nie) 
1 
a.( T De) 
which is usually available for determining stability. We ask, 


N(z) 


ae re) = Ap(D@) + N@)?”. 


“When does an(1 + 


STABILITY OF DISTRIBUTED SYSTEMS 911 
The answer is that this happens exactly when 
N Nie). 
0 = Ax(D@) + ai + 22 (23) 


To develop a more practical criterion let us rewrite this expression 
using n, to be the highest power of z in F(z), and N; to be the number 
of zeros of F(z) inside [. Then 


= Ar(D@)) + Ac(D@) + N@) — Ac(D@) (24) 
0 = 2nNp + NointT — NpT (25) 


where we have neglected those terms which become small for large z. 
If we limit further consideration to systems which are open-loop stable 
(i.e., Np = 0) then (25) requires that npivy = np. In most practical 
situations, the open-loop gain is bounded at infinity, that is to say 
Ny S&S Np. Hence npiy S np. Since Np.y < Np requires 


Ne) _ 
lim De) D(z) 


we can conclude that counting encirclements of the open-loop gain is a 
valid method for determining stability [i.e., (25) is satisfied] for systems 
which are open-loop stable (Vp = 0) and whose gain is bounded at 
infinity (np.n S np) but does not approach —1 for large frequencies 


(npin { Np). This includes the case, usually found in practice, that the 
open-loop gain approaches zero for large frequencies. 


= —1, 


V. EXAMPLES 


All examples refer to Fig. 2 and are chosen to illustrate various aspects 
of the analysis. At various points in the examples, the following assump- 
tions concerning G(s) and Y(s) are referred to: 

Al. G(s) and Y(s) are each the ratio of two polynomials having real 
coefficients with no singularities on Re (s) > 0. 

A2. lim, G(s) = k, where k, is real and | k,| S 1. 

A3. lim,.. Y(s) = lim,.,. sC where C is non-negative real. 

Assumption Al requires G and Y to be stable transfer functions. 
Assumption A2 insists that the feedback gain at infinity be less than 
unity. Assumption A8 is physically appealing. 

Some unusual properties of the natural frequencies of this system 
have been described elsewhere.*”” 

Example 1: In this example, Assumptions Al, A2, and A3 are invoked, 
and G(s) and Y(s) are given by (3). We wish to show that assumption 
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(11), used in previous derivations of Michailov’s Criterion, is not met 
and that it is valid to count encirclements of the Nyquist plot to de- 
termine stability. By Al, De(s) and Dy(s) have no right-half plane 
zeros. Thus A(s) has right-half plane zeros exactly when A(s)D¢(s)Dy(s) 
has right-half plane zeros. 

From Assumption A2 we conclude that deg Dg(s) 2 deg Ne¢(s). 
From A3 we conclude that deg Ny(s) > deg Dy(s). From this, and (5) 
and (6), we see that the principal term is present and that the assump- 
tion (11) used in previous proofs of Michailov’s Criterion is not met. 
In fact, one can readily convince oneself that this will be the case when- 
ever lossless transmission lines are involved, since all exponential terms 
will involve hyperbolic functions. 

Using the notation of (6), the open-loop gain (8) can be expressed as 


R,(s)e** 


= RG) + Rae ™ 


(26) 


whose norm becomes small for large, right-half plane values of s. 
Hence it is valid to count encirclements of the open-loop gain about 
the point +1. 

Example 2: Here we show the necessity of the open-loop stability 
requirement. Suppose Assumptions Al and A2 are invoked and further 
assume that De(s) = 1, Nc(s) = ki, Dy(s) = 1, Ny(s) = sC, + g. 
This corresponds to terminating the line in a capacitance C, and shunt 
conductance g ¥ 0. The open-loop gain for s = jw becomes 


ky 


CO) = aaa = tC) Sik ED I 


(27) 


which is real only when sin kw is zero. This implies that cos kw is +1. 
Thus if | k, | < 1, G.(jw) cannot encircle the +1 point. (This result is 
in agreement with Assertion 2 of Ref. 5, to which this problem corre- 
sponds if g = 0. It is intuitive that adding losses to a lossless system will 
enhance stability.) 

To show the necessity of the open-loop stable requirement, note that 
g can be either positive or negative. From the Pontryagin Criterion, we 
see that (27) is then stable or unstable, respectively, and that for 
|k,| < 1 the closed-loop system is stable or unstable, respectively. 
However, in either case there are no encirclements of the critical point 
by the open-loop gain. 
Example 8: Let 
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a - 
Dd as" 
1=0 

n 
» bjs’ 
7=0 


> es* 
Ze V(s) = = Cp ¥ O. 


Dp 
ds 
1=0 


In this final example we look at how the a, b, c, and d coefficients of (28) 
enter into the R;(s) polynomials of (6) and into the Q,(e *) polynomials 
of Theorem 1 and its corollaries. We show how the Assumptions A2 
and A3 affect whether the system satisfies Corollaries 1 and 2, and 
thereby provide examples of such systems. Using (8) and (28), (5) 
becomes 
F(s) = s"*?{b,(d, + ¢,) — 2a,d,e-"* + b,(d, — c,)e7**} 
ah sere -1 air Cy-1)b, ae b,-1(d, a Cy) 
as 2[a,,-1d> + Anlly—rJe 


a. [b,.(d,-1 = Cy-1) a b,-1(d> = Cle **} 





G(s) = 


(28) 


n+p—2 


+ DY s"{ DO bd: +e) — 2" YF ad; 
+e 5) bd; — ¢:)}. (29) 


First we investigate the zeros of the coefficient of s"*” in (29). [This 
coefficient corresponds to Q,(e-*) in Corollary 2.) For simplicity let 
e’* = z. This maps the left-half s-plane into the unit circle in the z plane. 

If d, = c,, then the coefficient of s**” has zeros whenever b,c, = 
a,,C,z . If a, were zero, the coefficient in question would become constant, 
which satisfies the conditions of Corollary 2. If a, is not zero, then the 
condition under discussion simplifies to 2 * = b,/a,. All solutions of 
this will satisfy | z| < 1if|a,|< |, |. Hence all zeros of the coefficient 
of s"*? in (29) will lie in the left-half plane if |a,| < |6,|. This is 
intuitively appealing since this requires that G(s) have less than unity 
gain at large frequencies (as required by Assumption A2). 

On the other hand, if d, ¥ c, , then the zeros of interest are solutions of 


20 nly a 


@)y — aes (e') + ea 0. (30) 
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It is well-known’ that solutions of (30) (for 2~*) have magnitude less 
than unity if and only if the following three conditions are met. 














dp + Cy 
era a | (31a) 
dy +c, 2d, Qantlp 
: - d, — Cp c d, — Cp 2 b,(d, a 3) xo)) 
2d, 20ndly 
d, — Cp 4 ~~ ba(dp a Cy) te) 


Thus conditions (31) are NAS for | z | > 1. Condition (31a) requires that 
d, and c, have opposite sign. Since this corresponds to terminating the 
line in a negative conductance at high frequencies [i.e., lim,.,,Y(s) < 0], 
we reject this case. If both d, and c, are nonzero, and have the same 
sign, (81a) is violated. If d, # 0, (81b) and (8lc) together require 
|a, | < |b, | as before. The remaining possibility is that d, = 0. This 
is a reasonable physical assumption; in fact, it is required by Assump- 
tion A3. Invoking Assumptions A2 and A3, the coefficient in question 
now becomes b,c,(1 — e~™*) which has an infinity of purely imaginary 
zeros, and this example no longer satisfies Corollary 2. 
To see if it satisfies Corollary 1, rewrite (29) as 


F(s) = 8"*?""{(sDpCp + Da—1Cp + DaCp-1)(1 — e?"*) — 2a,d,_1e 
+ Bady-1(1 + e7™"*)} 


n+p—2 
ae a s”{ Ds Lb\(d; + ¢:) — 2e“a,d; + eb,(d; — e,)]}. (82) 
m=0 fj+it=m 
We complete this example by finding conditions under which the coeffi- 


cient of s"*?* in (32) satisfies the conditions of Corollary 1. This coeffi- 
cient can be written as 


he as Ie hice Aha baa ade 


— (is ws + b,c, + 0,¢,-1 — bxd,-1): (33) 


Let w = ks. Then (83) becomes 


dyCp E 4g bat Mop bts | 


k b, Cp Cy 
-w bey kb,-1 kep—1 = Kaas -2Qw 
2a,d,_1€ a | + “He. + a ee oo (34) 
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Zeros of (34) are given by the solutions of (35) 





(w+ a) +7" — (wt ple” = 0 (35) 
where 
Hc p( b= ao Cp-1 3 tt) 
n Cy Cp 
a: n-1 Cp-1 te) 
oe a2 a Cp Cp 
— of 2 Fem 
y = —2k ie 


We assume that a > 6 since a — B = 2d,-_,/c,k which is positive by 
Assumption A3. 

It is also reasonable to assume a > 0, since b,_,/b, must exceed zero 
for the denominator of G(s) to be strictly Hurwitz,* and since c,_,/c, 
less than zero would imply zeros of Y(s) in the right-half plane. These 
considerations also imply that |a| > |6]| . Using these assumptions 
(i.e., a > B,a >0,|a| > | B]) it follows that 


Jw+al>|w+s6| 
for Re (w) = 0. Evaluating the magnitude of (35) on Re (w) 2 0 yields 


|w +a + ye" — (w+ Ble” | 
2 laren l= lilo" = lee le 
>|wte|—|rvy|—|w+s]. 
If lim,.. G(s) = 0, then a, = 0 and y = 0. This means that (35), and 


hence (83) and (84), have no zeros in the right-half plane and thus 
Corollary 1 is applicable to this problem. 


VI. CONCLUSIONS 


It has been shown that the time-honored technique of determining 
the existence of unstable poles of a closed-loop gain by counting encircle- 
ments of the critical point of the open-loop gain along a finite segment 
of the imaginary axis remains valid for a large class of distributed 
parameter systems of practical importance for which the open-loop gain 
approaches zero for large frequencies. Existing limitations of the Michai- 


* A well-known necessary condition for a polynomial to be Hurwitz is that all 
coefficients have the same sign (see, for example, Ref. 6, p. 281). 
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lov Criterion have been removed so as to include physical systems of 
lossless transmission lines. 
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APPENDIX 


A.1 Pontryagin’s Criterion 


Let h(z, t) be a polynomial with complex coefficients in the two 
complex variables z and ¢. Pontryagin’ has developed necessary and 
sufficient conditions that the function H(z) = h(z, e’) have zeros with 
only negative real parts. We now present one of Pontryagin’s main 
results. Let r and s be the degrees of the polynomial h(z, t) with respect 
to z and ¢. Then the principal term of A(z, f) is the term containing the 
product zt°. Pontryagin showed that if h(z, t) does not contain the 
principal term, then H(z) has an infinity of zeros with arbitrarily large 
positive real parts. 

Let p(-) and q(-) be real-valued functions of a real variable. We say 
that the zeros of these two functions alternate if: (¢) they have no 
common zeros, (77) they have only simple zeros, and (277) between every 
two zeros of one of these functions there exists at least one zero of the 
other. The result of Ref. 1 which will be used in the present, study is: 

Let A(z, t) be a polynomial with the principal term and H(iy) = 
F(y) + iG(y) where F(y) and G(y) take on real values whenever y is 
real. If all zeros of the function H(z) have negative real parts, then all 
zeros of F(y) and G(y) are real, alternate, and 


G'y)FYy) — F'y)Gy) > 0 


where superscript prime denotes the derivative. In order that all zeros 
of H(z) have negative real parts, it is sufficient that all zeros of F(y) 
and G(y) are real and alternate and that G’(y)F(y) — F’(y)G(y) be 
positive for some y. 


A.2 Proof of Theorem 1 


Choose a real number R’ > 1 such that Re (z) > 0 and F,(z) = 0 
implies | 2 | < R’. Define the set 6, 


6= {z|Re(z) >0 and |z|> R’}. 
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Let 
D, = inf | F,(z) | > 0 
zee 


and 


mm 


M,= sup )>>[a;; |. 
f=0,00+,k-1 i= 
If MM, = 0, then F(z) = F,(z) and the theorem is trivially true. If 
M, > 0, then for all z e 6 the following inequalities hold. 
k-1 


| F@ | = |eF@) | — 2 2’Qie*) | 


i=0 


k-1 m 
Bet) De- Sle | Dlaw | per | 


k-1 m 
= D,|2 | - dle | Le [a | 
i= i= 


k~-1 


=D, [2//-— M.D Ie | 
k 
> D,|2|'- mu, zt=1 
eae 1 
» 2 De|- ) -M)+™, 
= bei 
= Elie) - wo. + uo + 2} 


which is positive for |z| > 1+ M,/D, = R, . Hence the magnitude of 
all zeros.of F(z) must be bounded by R, the larger of the numbers FR’ 
and R,. The theorem follows at once by noting that F(z) is analytic 
and that an analytic function has at most a finite number of zeros in 
any finite region. 


A.3 Proof of Theorem 2 


We now wish to derive an expression for the number of rhp zeros 
of F(z). We choose a contour I varying along the imaginary axis from 
—y to y (call this portion w) where y 2 R, of Theorem 1 and close it 
by a contour C’ outside the semicircle of radius R of Theorem 1. Let 
F(z) and ¥(z) be as defined in (15) and (16), respectively. We choose 
contour C (and increase y, if necessary) so that | ¥(z) | < 1 along C. 
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Let N be the number of zeros of F(z) inside I and let 
Ar (F(@)) 
be the net change in arg (F(z)) along I. Then N, the number of zeros 


of F(z) enclosed by T — counterclockwise travel), is given by 


N= 5 An(F (2) = 5 A.(F@) +5 ~ Ac(F(@)) 


a : A.(F@)) + Act) + A.Fx@)) + Ac(l + ¥@)) 


since F(z) has no zeros on I. Since 1 + y(z) does not wind around the 
origin (its real part being always positive), Ac(1 + w(z)) = 0. Since 
Ac(z") = k/2 we have thus proven Theorem 2. 
A.4 Sufficient conditions that Q, does not wind around the origin 

Let 


Tt, = min Ups 


T=2Z,cee,m 


Then 


m 
Ain —(r;— 
ke + >» in 5 (rj TH agg ry 


j=2 Din 














ae | | Re [Q,] | ae -> Is 


me 


Hence | Re [Q,] | > 0 for « > 1/r, In a where 


Thus we need only consider a eae defined by 0 < x S Ina/7,, 
0 < y < P. [Note that any contour C will work if a < 1 which is the 
case if assumption (11) is used.] Q,(e~*) does not wind around the origin 
for any contour with z > In a/r;, , since Re [Q,] does not change sign. 
If >0"., a;7** ~ 0 for0 < x < Ina/r; then Re [Q,] does not change sign 
for y any integer multiple of P, and0 < x < Ina/r,. A suitable contour 
could then consist of a horizontal line at y = KP from x = 0 tov = 
In a/r, , where K is an integer large enough so that KP > R of Theorem 1. 
The rest of the contour in the first quadrant could be semicircular. The 
contour is completed in the fourth quadrant by the mirror image of the 
first quadrant. Since Re [Q,] ~ 0 along this contour, Q,[e~*] does not 








a 
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wind around the origin. (Note that a;, > 0,7 = 1, --- , m is sufficient 
to satisfy the conditions of this paragraph.) Furthermore, since Re [Q,] 
is even in y, Ac(Q,(e *)) = 0 along the contour chosen. 

This result can be extended to include the case where Re [Q,] has 
simple zeros on y = KP. In this case, use semicircular indentations 
around such points, in the direction to have Im [Q,] > 0, in both first 
and fourth quadrants. (Hence the contour ceases to be symmetrical 
about the real axis.) Thus, along the deformed horizontal lines, the 
graph of Q,(e *) remains in the upper-half plane. Along the semicircular 
portion it remains in either the right- or left-half plane. Hence no 
encirclements of the origin are possible and again Ac[Q,(e *)] = 0. 
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The Spectral Density of a Coded 
Digital Signal 


By B. S. BOSIK 
(Manuscript received August 23, 1971) 


The stochastic process appearing at the output of a digital encoder is 
investigated. Based upon the statistics of the code being employed, a sys- 
tematic procedure 1s developed by means of which the average power spectral 
density of the process can be determined. The method is readily programmed 
on the digital computer, facilitating the calculation of the spectral densities 
for large numbers of codes. As an example of its use, the procedure is 
applied in the case of a specific multi-alphabet, multi-level code. 


I. INTRODUCTION 


In recent years, increased interest has been focused on more complex 
multi-alphabet, multi-level codes.’~* Such codes are designed to produce 
a digital pulse train with specific spectral properties making it suitable 
for transmission over digital repeatered lines. These properties generally 
include the absence of a de component and a strong spectral component 
from which timing can be extracted. This paper presents a method for 
calculating the spectral composition of the pulse trains resulting from 
the use of these codes. The procedure is applicable to a wide variety 
of codes. 

A code may be defined as a set of mappings from a set of input 
symbols (or words) to a set of codewords. Each mapping is called an 
alphabet. The code may use different alphabets depending upon the 
state of the coded signal.’ It is desirable for unique decipherability 
that the set of mappings be one-to-one, i.e., that no matter to how 
many alphabets a codeword belongs, it always corresponds to the same 
input symbol. However, this restriction will not be imposed here. 

In general, when the code is applied to a sequence of input symbols, 
the resulting encoded signal is a stochastic process, the statistics of 
which depend on the input symbol sequence statistics and the code 
statistics. For convenience, a random input symbol sequence will be 
assumed so that the input symbols are equally likely. Even if the 
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symbols are not equiprobable, the procedure for calculating the spectral 
density outlined in Section III remains valid, although the methods 
for calculating the required signal statistics described in Section IV 
must be modified. The spectral density derivation only becomes in- 
applicable when the statistics of the input symbols vary with time, 
so that the ergodicity and stationarity assumptions of Sections II 
and III are no longer valid. 

The codes to be considered in this paper will have N states, each 
state corresponding to a single alphabet. The alphabet assignment 
need not be unique, i.e., more than one state can correspond to the 
same alphabet. The codes will have a block length DL and the number 
of codeword symbol values (levels) will be M. 


II, THE CODED SIGNAL 


The codeword symbols are, in general, transmitted on some standard 
pulse shape g(t) at intervals of duration 7’. The signal, then, may be 
expressed by 


-) 


x) = So ag(t — nD) (1) 


where a, is the codeword symbol value for the time slot n7? S$ it S 
(n + 1)T. The values which {a,} assume are determined by the code 
and the input symbols which are to be coded. The discrete parameter 
random process formed by the sequence of codeword symbols {a,} 
has an autocorrelation function R(k) = E{a,d,,.}, and, thus, is assumed 
to be wide-sense stationary. The autocorrelation function of the coded 
signal x(t) is, then, 


i] ine] 


RAt+ 7,0 = Dd) DS Rim —njglt+7—nT)g(t— mT) (2) 
which is, in general, a function of both ¢ and 7. The coded signal is not, 
therefore, wide-sense stationary. However, it is easily shown that 
R.(t + 7, €) is periodic in ¢ with period 7. The coded signal is, then, 
a cyclostationary process. 


If. THE POWER SPECTRAL DENSITY OF THE SIGNAL 


Since the coded signal is not a wide-sense stationary random process, 
the Fourier transform relationship between the autocorrelation function 
and the power spectral density cannot be invoked to find its power 
spectrum. However, the average power spectral density of a cyclo- 
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stationary process of the form described above has been derived by 
W. R. Bennett.” Under the assumption that the process is ergodic, 
the spectral density is 


walf) = 51 EH P [20 +2 RW) cos oak | (3) 


where G(f) is the Fourier transform of g(t). The determination of the 
spectral density, then, requires the calculation of the autocorrelation 
function R(k). 

A method of calculating R(k) can be derived as follows. Let the 
probability of being in state 2 during time slot n be 


P(s, = S:) = P(S;); tt AD ee BIN (4) 
and let the probability that the symbol a, assumes the value A, , 1 = 
L 2, , M, and ansx assumes the value A,,,m = 1, 2, , M, given 
i 8S, = S; be 


P(a, = A, ) On+k = An | Sn = S;) = P,(A, ey. oe | S,). (5) 
Then R(k) can be expressed as 


R(k) = > PS, > s AiAmPi(Ar , Am | Si). (6) 


=1 m=1 


But in time slot n, there is a probability of 1/Z of being in the jth 
symbol, a; , of a codeword, j = 1, 2, --- , L. Thus, 


L 
1 : 
P,(Ai , Am | S;) = Dz PilAr , An | 8: 5 5). (7) 
Substituting eq. (7) in eq. (6), we obtain 
1 N L M M 
R(k) — L ds P(S:) dy do du A,A,P,(A, ’ An | S; eps (8) 
Now define 


M 
Dd AAnP(Ar , An | 8:59) 


ee 


Rs.(j, j T k) = 
1=1 m=1 
= Bg {0 ;0; 42}; j= Ly 2yo8* ,b. (9) 
Thus, 
- 1 N L 
RW) = | UPS) VRsG,i +H. (10) 


The equation by which Rg;(j, 7 + k) is calculated is a function of 
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both j and k. The values of 7 and k determine the relative positions of 
the codewords to which the symbols a; and a;,, belong. This knowledge 
combined with a knowledge of the coded signal statistics allows the 
calculation of Rs;(j, 7 + k) as follows: 


(1) k = 0: 
For k = 0, a; and a;,, are in the same position of the same codeword. 
Thus, 


A 
Rei, f) = DS APPA 8.59) (11) 
where 
P(A,|S;, 9) = P(a; = Ai|s; = 8). (12) 
(ii) k = 1,2,---,D: 


For this range of k, a;,, is in the same codeword as a; forj S L — k, 
and is in the next codeword for 7 > L — k. Thus, 
Rs.9,9 + k) 


M 


M 
> > A:AwP A: , An | 8:9); 35 L-k 


l=1 m=1 
= M M N ; (13) 
[Fi De ArAn 27 P(Ar s So | Se 5 3) 
=1 m= n= 
where 
P(A,,8,|8:,7) = Pla; = Ar, Sin = Si] 8; = Si). (14) 

(uw) k=L+1,L42,---, 2D: 

For this range of k, a;,, is in the codeword immediately following 
the codeword containing a; for 7 S 2L — k, and is two codewords away 
for 7 > 2L — k. Thus, 


Rs;(j, j a k) 
M M 


Dy Le AiAn DU P(Ar, 8, | Ss 59) 


t=1 m=1 
MM N N 
Dy de AAn DP P(Ar, S, | Se» 5) De PaCS, | Sx) 


PAe'| Spee = 2D), j>2L—k 
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where 
P,(S; | S,) = P(Sj41 = 8, | fae S,). (16) 
(wv) k = QL +1, QL +2, ---,(Q@+ DL for Q = 2: 


For this range of k, a;,, is in Qth codeword following the codeword 
containing a; for 7 S$ QL — k, and is Q + 1 codewords away for 7 > 
QL — k. Thus, 


Rs, J + &) 


M 


M N N 
Dy 2 Aram DE P(Ar » Sa | Si 5 1) Dy Po-s(S, | Sx) 


t=1 m=1 


a ‘P(A, | S,,j +k — QD), 37S QL—-—k (17) 


M N N 
> AAn di P(Ar, S| 8: 53) 2s PalS, | S,) 


Ms 


U 


Ml 
= 


m=1 


(A, |S,,j7 +k -Q+ 1D), j>QL—k 


where 
PS, | S,) = 2 Po-(S: | S,)Pi(S, | 8), Q = 2 (18) 


and can be calculated recursively. 

In summary, then, the procedure described in eqs. (11) through (18) 
is used to calculate {Rs,(j, 7 + k)} which is substituted in eq. (10) to 
obtain {R(k)}. The spectral density w.(f) can then be obtained from 
{R(k)} by means of eq. (8). 

It is easily seen that for any code other than the most trivial, the 
calculation of the spectral density is a formidable task. However, it 
is a relatively straightforward procedure to program a digital computer 
to perform the above calculations; and this is the most profitable use 
of the procedure. 


IV. THE CODED SIGNAL STATISTICS 


The calculation of the spectral density described above requires the 
knowledge of numerous probabilities concerning the code. These 
statistics are, in general, readily obtainable from the code by merely 
counting the number of occurrences of the phenomenon involved, or 
are determined from simple calculations involving previously obtained 
probabilities. The procedure for obtaining each of the necessary prob- 
abilities follows: 
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(t) The State Transition Probabilities: 

P,(S, | S,) = P(Sj41 = S, | 8; = Sn) pn =1,2,-:- iN. (19) 

This transition probability is obtained by counting the number of 
codewords in the alphabet used when in state S, , whose next state is 
S, . (Given any codeword in any state, the next state is uniquely defined 
at the end of that codeword.) The resulting sum is divided by the 
number of codewords in the alphabet. The probability of a transition 
from state S, to S, in Q steps, Pa(S, | S,), can be calculated recursively 
from P,(S, | S,) via eq. (18). 

(iz) The State Probabilities: 

P(S;) = P(s, = 8); «t= 1,2, ---,N. (20) 

The probability of being in state S; is calculated from {P,(S, | S,)}. 

{P(S;)} are the solutions to the set of simultaneous linear equations 


YS PS)P(S, |S) =P); F=L%-",N. CY 


However, only N — 1 of these equations are linearly independent. 
An additional equation must be used: 


¥ PS) = 1. (02) 


(177) The Symbol Probabilities: 
P(A, | 8; , 9) = Pa; = Ails; = 8); t= 1,2,---,N 
j=1,2,---,L 
L=1,2,-:-,M. (28) 
These probabilities are determined by counting the number of 
occurrences of symbol A, in the jth position of the codewords in the 


alphabet used when in state S; . This sum is divided by the number 
of codewords in the alphabet. 


(wv) The Symbol Combination Probabilities: 
P(A, An | 8; , 3) 
= Pa; = Ai, aj4, = A, | s = 83); 1125 tee NN 
=1,2,---,L 
lLm=1,2,---,M. 
ebay (24) 


ms. 
| 


lI 
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This is the probability of the occurrence of the symbols A; and A, , 
in time slots j and j + &k respectively, within the codewords of the 
alphabet used in state S; (i.e., k < L — j). Thus, for position 7 of the 
alphabet of S; , the number of times a; = A; and a;,, = A,, in the 
same codeword are counted, and divided by the number of codewords 
in the alphabet. 


(v) The Conditional State Transition Probabilities: 
P(A, , 8, | 8: , 9) 
= P(a; = A; , 81 = S,|8; = Si); 7,2 = 1,2,---,N 
j=1,2,°:-,L 
l= 1,2,---,M. (25) 


This is the probability that a codeword in the alphabet of state S; 
has the symbol A, in the jth position and has the state S, as its next 
state. Thus, the number of times that a codeword is in the alphabet 
of state S; , whose next state is S, and whose symbol level is A; in the 
jth position, are counted, and divided by the number of codewords 
in the alphabet. 

These five sets of statistics are all that are required to perform the 
calculation of the spectral density. Although following the procedures 
for obtaining these probabilities is a very straightforward task, it is, 
again, a tedious one, especially for any reasonably complex code. Here. 
again, the digital computer can be used to good advantage. 


Vv. AN EXAMPLE-THE FRANASZEK MS-43 CODE 


The Franaszek MS-43 code’ is a ternary, 4-state, 3-alphabet code 
of word length 3. It is one of a family of codes designed to produce a 
digital pulse train with specific desirable properties. These properties 
include the absence of a de component, a bounded sum of previous 
digits, and a strong spectral component from which the signaling 
frequency can be derived. The code is shown in Table I. Alphabet A, 
is used when in state S, , alphabet R, is used in state S, or S; , and 
alphabet R; is used in state S,. The state is determined at the end of 
a codeword by summing all previous digital symbols. This sum is 
inherently restricted to be 1, 2, 3, or 4 corresponding to states S, , S2, 
S; , and S, . Tables II through VI list the statistics necessary for cal- 
culating the spectral density as determined by the procedures described 
in Section IV. The digital computer was utilized to perform the spectral 
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TABLE IJI—StratTe PROBABILITIES FOR 
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P(Si) 

7 P(S;) 
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density calculation. The resulting normalized spectrum is plotted in 
Fig. 1. The result is consistent with the expected properties of the 
coded signal spectrum, i.e., zero de component and periodicity with 
period 1/7. 


VI. CONCLUSION 


A general procedure for determining the average power spectral 
density of a coded digital signal has been presented. The procedure is 
long, but straightforward and readily programmable on the digital 
computer. With the aid of the computer, the spectral content of signals 
resulting from the implementation of large numbers of codes can be 
obtained. 
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TABLE V—SYMBOL COMBINATION PROBABILITIES 
For MS8S-43 CopE 


Py(Ag, An\Si, j) 
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TasBLE VI—ConplITIONAL STATE TRANSITION PROBABILITIES FOR MS-43 CopE 
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Fig. 1—Power spectral density for the Franaszek MS-43 code. 


of a private communication from A. Fromageot discussing the problem 
of calculating the spectral density of the Franaszek MS-43 code. 


REFERENCES 


1. Franaszek, P. A., ‘“Sequence-State Coding for Digital Transmission,’’ B.S.T.J., 
47, No. a (January 1968), pp. 143-157. 

2. Sipress, J. M., “A New Class of Selected Ternary Pulse Transmission Plans for 
Digital Transmission Lines,’’ IEEE Trans. Commun. Tech, COM-13, (Septem- 
ber 1965), pp. 3866-372. 

3. Neu, W., and Kiindig, A., ‘Project for a Digital Telephone Network,’’ IEEE 
Trans. Commun. Tech, COM-i6, (October 1968), pp. 633-648. 

4. Johannes, V. I., Kaim, A. G., and Walzman, T., “Bipolar Pulse Transmission 
with Zero Extraction,” IEEE Trans. Commun. Tech., COM-17, No. 2 (April 
1969), pp. 303-310. 

5. Bennett, W. R., “Statistics of Regenerative Digital Transmission,’’ B.8.T.J., 
37, No. 6 (November 1958), pp. 1501-1542. 


Copyright © 1972 American Telephone and Telegraph Company 
Tue Bett System TECHNICAL JOURNAL 
Vol. 51, No. 4, April, 1972 
Printed in U.S.A. 


Objective Measures of Peak Clipping 
and Threshold Crossings in 
Continuous Speech 


By PAUL T. BRADY 
(Manuscript received September 1, 1971) 


This study reports data on the statistics of instantaneous speech levels 
in continuous speech samples, with special emphasis on threshold crossings 
and other quantities related to peak clipping. All clipping thresholds for 
each speech sample are defined with respect to the individual speech level 
for that sample, specified in equivalent peak level (epl). Speech clipping 
ts also treated as speech-correlated noise by assuming that it 1s caused not 
by a voltage limiting process, as actually occurs, but by an additive ‘‘phantom 
signal’’ that will cause the original signal to appear to be clipped. Empirical 
measures are obtained for the percent time a clipping level is exceeded, 
for the relation between phantom signal (1.e., noise) power and clipping 
level, and for the loss in signal power resulting from clipping. The relation 
is also established between epl and average (rms) power to allow signal 
power and signal-to-clipping notse levels to be specified. 


I. INTRODUCTION 


This study reports data on the statistics of instantaneous speech 
levels in continuous speech samples, with special emphasis on threshold 
crossings and other quantities related to peak clipping. Since a standard 
measure of specifying the amount of peak clipping is lacking, the study 
begins by arbitrarily defining the clipping level in terms of an objective 
speech level measure, the equivalent peak level (epl).* Then, measures 
of time spent above various thresholds (or clipping levels) are reported. 
Finally, a method is suggested for interpreting clipping as speech- 
correlated noise. Empirical measures are given for the amount of noise 
introduced as a function of the amount of clipping, and signal power 
lost because of clipping. 

This paper does not address itself to subjective measurements related 
to clipping of the speech signal. The subjective effects of clipping must 
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eventually be considered in a treatment of the broad problem of deter- 
mining clipping performance objectives. It is also necessary to examine 
the objective effects of clipping on the transmitted signal so that 
guidelines can be established for design and operation of transmission 
systems. This paper is directed toward providing data to assist in 
engineering considerations of speech circuits. 


II. DEFINITION OF A MEASURE OF PEAK CLIPPING 


Peak clipping, produced by an abrupt limiting of a waveform when 
its amplitude attempts to exceed a clipping voltage, is possible in 
virtually any speech transmission system if the speech level becomes 
too high. It is quite easy to specify the limiting voltage in an absolute 
sense (e.g., +0.5 V), but the speech impairments caused by clipping are 
a function of the relation between the clipping voltage and the speech 
level. 

One clipping measure sometimes used is the difference between the 
clipping level and the volume unit (VU). This difference is quite variable 
because of the variability in the VU measure, as determined in experi- 
ments by Shearme and Richards,” and Brady.* A study related to VU 
and speech peaks was done by Noll,* who measured the difference 
between the highest instantaneous peak of a speech sample and the 
VU for possible application to peak clipping. This difference was also 
quite variable among samples, causing Noll to conclude that ‘volume 
(i.e, VU) distributions cannot easily be converted to peak distribu- 
tions.”’ He states further that the variability “‘might have been caused 
by inaccuracies inherent in reading a VU meter.” 

Another clipping measure sometimes used is the difference between 
the highest instantaneous peak in the speech sample and the clipping 
voltage. Although this measure is objective and fairly easily obtained, 
it is too dependent on the near chance value of the highest peak. Should 
this peak be due to an unusually loud segment of speech or even a click 
or spurious signal, the clipping measure has little relationship to the 
major part of the speech sample. 

The present study begins by defining a new measure of clipping as 
the difference between the epl of a speech sample and the clipping level. 
This difference is measured in dB. It promises to be more stable than 
VU-related clipping measures since the epl is an objective measure 
with considerably less variability than the VU.’ It is emphasized 
that this new clipping measure is not proposed here as a new standard; 
it is simply adopted here as a more stable basis than previous measures 
so that subsequent derived measures will be more precisely defined. 
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There are three principal derived measures discussed here. One 
measure is a2 count of the percent of time that the clipping level is 
exceeded. For example, if one clips 3 dB below the epl, what percent of 
the waveform would be affected? The second is the measure of power 
in a “phantom” interference signal. The clipping is assumed to be caused 
not by a voltage limiting process, but rather by a second interference 
signal added to the first. By estimating power in the phantom signal, a 
signal/noise ratio can be defined for different clipping levels. The third 
measure is power lost due to clipping. 


III. USE OF CONTINUOUS SPEECH 


Many objective measures of speech are strong functions of the 
“activity factor” or the percent of time a person is talking. For example, 
a person speaking at a fixed level can change his average power by 
varying his activity factor. In the present study, the quantities to be 
measured are dependent on the time base chosen. 

In studies on clipping, there is little interest in what happens when 
speech is not present. It makes little sense to examine long silent 
intervals, which would occur during a telephone conversation. Therefore, 
this study’s measures are restricted to only those times when speech is 
present. 

The author knows of no speech detecting technique that will define 
intervals of speech activity in a manner that is insensitive to arbitrary 
choices of parameters such as detector threshold setting. This is shown, 
for example, in two previous studies in which substantial variation in 
detected speech patterns occurred with fairly small changes in detector 
parameters.”’® In the present study, the detection process is bypassed 
by using ‘continuous speech” material that contains a negligible 
number of perceived gaps. 

The method of recording continuous speech is documented in Ref. 6. 
These recordings were prepared from recordings of experimental tele- 
phone conversations by manually splicing together sections containing 
a continuous flow of speech. There were eight male and eight female 
speakers each providing two separate recordings, except for two women 
who each made only one recording. In all, there were 16 male and 14 
female continuous speech samples, the average length of each sample 
being about 55 seconds. 

Measures made on these samples will be considered valid measures 
made “during speech.” For example, if a threshold is exceeded here 
3 percent of the time, this will imply that in noncontinuous ‘‘real’’ 
speech, the threshold is exceeded 3 percent ‘‘during speech” or ‘‘while the 
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speaker is talking.” The continuous speech tapes were edited by the 
author using a highly empirical procedure and personal judgment. There- 
fore, all measures reported here are dependent on the author’s personal 
judgment, to the extent that the selection of continuous speech material 
is dependent on this judgment. 

The 30 samples of continuous speech provide a basis for studying 
the variability of the desired measures over different speech samples. 
Note again that only 16 speakers are represented in the 30 samples, 
so that measures of variability are somewhat confounded if they are 
to be applied to variability over different speakers. This confounding 
is probably slight, since two samples from the same speaker are sometimes 
quite different in measures such as speech level and dominance of 
conversation (whether the person is basically listening or talking). 

The speech samples of all the men were played back at a level high 
enough to use nearly the full range of a 12-bit A-to-D converter. Once 
the level adjustment for males was set, it was not changed from sample 
to sample. The level adjustment was then reset for the women, and 
all female speech samples were played at the new adjustment. 


IV. RELATION OF EPL TO PEAKS 


The epl reads the peak of an assumed log uniform distribution of 
instantaneous voltage, whose rms voltage above threshold is the same 
as that measured in the sample.* If speech were distributed exactly 
according to the log uniform distribution, then the epl would be the 
highest level or the ‘peak.’ In practice, the speech distribution does not 
abruptly and neatly terminate at some fixed level. Occasionally there 
are voltages exceeding the range of levels of the speech. The expected 
epl would be a “nominal” upper voltage limit, which would be exceeded 
occasionally during loud speech passages. 

To get a quantitative measure of the relation between the epl and 
the highest instantaneous peaks, all 30 continuous speech samples were 
played into an A-to-D converter connected to a PDP-8 computer. 
The computer stored a histogram count of the measured voltages 
sampled at 1 kHz. After calculating the epl at the end of each speech 
sample, the number of A-to-D readings that exceeded the epl was 
counted. . 

On the average, the epl was exceeded 2.0 percent of the time (¢ = 
0.29 percent) for female speech samples and 2.9 percent of the time 


* The epl as defined in Ref. 1 has been revised as follows. Step 5a should read, 
‘Gf D < 6.75, set A = (D — 2,75)/0.4.” 


PEAK CLIPPING AND THRESHOLD CROSSINGS 937 


(c = 0.46 percent) for male speech samples. The average of all 30 
samples was 2.5 percent. In general, in noncontinuous or “standard” 
speech, the epl would be exceeded about 2.5 percent of the ‘time that 
speech is present.” 

The speech voltages that did exceed the epl occasionally reached 
levels of 7 or even 8 dB above the epl, but such events were very rare. 
The epl + 6 dB level was exceeded about one-tenth as often as the epl. 
That is, the epl + 6 dB was typically exceeded between 0.2 and 0.3 
percent of the time. Thus, the ep] + 6 dB appears to be a practical 
upper limit of the instantaneous speech levels. 


V. PERCENT TIME CLIPPING LEVEL IS EXCEEDED 


The log-uniform speech distribution model predicts that as a thresh- 
old (i.e., clipping level) is lowered, equal dB decreases in the threshold 
produce equal increments in percent time that speech exceeds the 
threshold.’ This model is somewhat inaccurate in the immediate region 
of the epl, but it has been found to hold up over a wide range (80 or 
40 dB) below the epl of most speech samples tested.’ Eventually, for 
very low thresholds, the model must break down because one could 
continue lowering the threshold in equal dB increments to minus 
infinity; and one certainly cannot indefinitely add equal percent time 
increments. 

Two experiments with continuous speech yielded data relevant to 
percent time above clipping level. The first showed that the epl itself 
is cleared about 2.5 percent of the time. This is one point on the ‘“‘percent 
time vs clipping level’’ curve. 

The second experiment sought to establish another point on the 
curve by obtaining a scatter plot of percent time over a range of thresh- 
olds for different speech samples. In an earlier study, percent time over 
a fixed threshold of —25 dBm was measured for 30 continuous speech 
samples. However, the samples were played at their original recorded 
levels rather than at levels equalized for A-to-D converter range. The 
epls had a sigma of 3.6 dB, which was a larger range than those obtained 
for the samples with equalized levels. If the epls had a fairly large 
range, it follows that epl minus threshold also had the same large range 
for a fixed threshold. 

The scatter plot of percent time over threshold vs threshold (re 
epl) is shown in Fig. 1. The mean value of the 30 points occurs at the 
coordinates 15 dB for epl minus threshold, and 30 percent for time 
above threshold. Perhaps the simplest way to provide a linear fit to 
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DIFFERENCE BETWEEN EPL AND THRESHOLD IN DECIBELS 


Fig. 1—Percent time threshold is cleared vs difference between ep] and threshold 
for continuous speech. 


the Fig. 1 data is to draw a straight line between the mean values for 
the two experiments. 

Each of these two points determining the linear fit is an average 
of 30 samples. An alternative means of fitting a line to the data could 
be to use all 60 points in a least-mean-square fit. This might yield a 
more ‘‘accurate’”’ fit to these data, but it must be remembered that the 
vertical axis, the ‘percent time” measure, is based upon an empirical 
definition of continuous speech. The straight line connecting the means 
of the two data sets is probably just as reasonable a fit if the curve is 
to be regarded as an approximate guide for application to speech circuit 
design. Note that in Fig. 1, the line passes within 5 percent divisions of 
most of the points, tending to justify the linear fit predicted by the 
idealized log-uniform speech distribution model. Previous work suggests 
that the linearity will be maintained until 30 or 35 dB below the epl.’ 

The line has a slope of 1.83 percent increase in time for every 1 dB 
drop in threshold, or approximately 9 percent for every 5 dB drop. 
This result can be combined with the conclusion of Section IV as a 
guideline for threshold clearance percentages: 

Instantaneous voltages rarely occur at a level higher than 6 dB 
above the epl. During continuous speech, the epl is exceeded roughly 
2.5 percent of the time. A threshold is cleared an additional 9 percent 
of the time for every 5 dB drop in threshold below the epl. 
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VI. RELATION OF EPL TO AVERAGE POWER 


While obtaining the epls of the 30 speech samples, measures were 
also obtained of the long-term rms power in the continuous speech. 
The average epl minus rms for the 30 samples was 10.0 dB, with a 
standard deviation of 0.8 dB. 

Therefore, during continuous speech, the speech power is about 10 dB 
below the epl. This result will be useful in the next section in relating 
speech power to clipping signal power. 


VII. POWER IN A PHANTOM CLIPPING WAVEFORM 


Instead of thinking of peak clipping as a limiting process, imagine 
it to be caused by the addition of a second phantom signal that will 
cause the original waveform to be restrained at some voltage. This 
process is illustrated in Fig. 2, showing the original signal, the phantom 
signal, and the result of adding them (i.e., the clipped signal). 

The phantom signal can be considered as speech-correlated noise. If 
its power is known, then a signal/noise (S/N) ratio for clipped speech 
can be determined. In applying the S/N ratio, be careful to remember 
that the noise is speech-correlated, and hence may produce very different 


ORIGINAL 
WAVEFORM 





PHANTOM 
SIGNAL 


SUM OF 
ORIGINAL + PHANTOM 
=CLIPPED SIGNAL 


Fig. 2—Illustration showing how clipping can be thought of as due to an additive 
phantom signal. 
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effects on intelligibility, detectability, etc., than would a steady back- 
ground noise with the same S/N ratio. 

Note that this ‘‘power’’ of the phantom signal is a fictitious quantity, 
and no such power is added to the signal, even though a voltage is added. 
In fact, clipping produces a net power loss since it removes part of the 
signal. 

The phantom signal power was calculated by playing a speech sample 
into the PDP-8 computer, and recording a histogram of the absolute 
magnitudes of all voltage levels obtained. When finished, the computer 
chose a fixed threshold level, 7’, and for each voltage level above this 
computed (Voltage — Threshold)”. An accumulator was updated with 
this quantity n,; times, where nz was the number of counts for that 
threshold. 

The computer then divided the total }> (V — T)’ by the total number 
of A-to-D samples for the entire speech sample. The time base for aver- 
aging was the total length of the continuous speech sample, not the 
number of voltages exceeding 7. Thus, the power for the phantom 
signal (the V — T voltage waveform) was an average ‘during the time 
speech is present.’”’ The lower the threshold, the greater the power in the 
(V — T) phantom noise signal would be, since (V — 7) was greater 
for each voltage and more voltages exceeded T. 

The above process was done for six thresholds: —5, —10, --- , 
—30 dBm. For each speech sample a curve of phantom power vs 
clipping voltage was produced. These curves for all 30 speech samples 
are plotted in Fig. 3. Each speech sample curve has been normalized 
to its epl. 

The author knows of no speech distribution model that would be 
appropriate for predicting the curves of Fig. 3. The log-uniform model, 
which works well for relating power to epl, is a poor fit in the voltage 
region close to the epl, since it incorrectly predicts an abrupt upper 
limit to the speech voltages at the epl. Lacking a model, the curves of 
Fig. 3 were treated as a scatter plot of independent points (each curve 
produced 5 or 6 such points) which were fitted with a third-degree 
polynomial. The curve fitting program, supplied by E. A. Youngs of 
Bell Laboratories, yielded the curve described by the equation below. 
In this equation, P = power in the phantom signal in dB (re epl) and 
T = clipping threshold in dB (re epl). (E.g., for clipping level 1 dB 
below the epl, 7 = —1.) 

For all speakers, 


P = —21.86 —1.193T —0.04437” —0.00057T”. 
This curve is plotted in Fig. 4. 
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CLIPPING POWER IN DECIBELS BELOW EPL. 
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Fig. 3—Clipping power in a phantom clipping signal as a function of clipping 
level, for 30 continuous speech samples. Each curve is normalized to the epl for the 
associated speech sample. 


VIII. POWER LOST FROM PEAK CLIPPING 


If a signal is clipped, the resulting signal has less power than the 
original signal. The power loss in dB was calculated for all 30 continuous 
speech samples using six clipping levels for each sample. The 30 curves 
of power loss vs clipping level are superimposed in Fig. 5. The poly- 
nomial fit to these data, shown in Fig. 6, is given by 


P = —1.084 + 0.3737 — 0.0127”, 


where P = power loss (if 5 dB is lost, P = —5) and T = clipping 
threshold (if 5 dB below epl, 7 = —5). (A second-order polynomial was 
sufficient to meet the “fit criterion” of the polynomial curve fitting 
program.) 

Also plotted in Fig. 6 is a curve of power loss vs clipping obtained 
by Wathen-Dunn and Lipke,’ after an empirical instantaneous speech 
level probability distribution function developed by Davenport.® 
Wathen-Dunn defines his zero dB clipping level reference point as that 
point exceeded by only 0.1 percent of Davenport’s speech signal, which 
consisted of having speakers read aloud from a book. Although such 
speech must have contained perceptible pauses, it is closer to ‘‘con- 
tinuous speech” than a telephone conversation. To compare Wathen- 
Dunn’s work with this, let us consider his 0.1 percent peak point for book 
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Fig. 4—Third-order polynomial fit to data of Fig. 3. 
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5—Power loss vs clipping level for 30 samples of continuous speech. 


Fig. 
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_-WATHEN—DUNN AND LIPKE, 1958, 


FROM DAVENPORT'S DATA,1952 


CLIPPED SPEECH RMS POWER IN DECIBELS 
BELOW UNCLIPPED SPEECH RMS POWER 
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Fig. 6—Solid curve is a polynomial fit to data of Fig. 5. Dashed curve 1s taken 
from Wathen-Dunn and Lipke.® 


reading as equivalent to the epl + 6 dB “‘peak”’ criterion established here 
for continuous speech. Fig. 6 of the present report is Wathen-Dunn’s 
Fig. 6 with the horizontal axis shifted 6 dB; that is, what he calls 9 dB 
of clipping is plotted here as 3 dB below the epl. 

In this study, the calculation of power loss is very close to Wathen- 
Dunn’s and is somewhat a repeat of his work, except that the present 
study is a direct empirical calculation of the loss, rather than a theoretical 
calculation based on an empirical probability function. 


IX. DIFFERENCES BETWEEN MALE AND FEMALE CONTINUOUS SPEECH 


Some of the results of this study have been reported separately for 
male and female speech because there appears to be a real difference in 
the speech activity factor in the two sets of speech samples. 

The average epls for men and women were adjusted to be roughly 
equal. The final average epls were measured as —5.86 dBm for women 
and —7.18 dBm for men. The women’s levels therefore averaged 
1.3 dB higher than the men’s. However, measurements of average power 
(true rms) show that women averaged —16.46 dBm, and men averaged 
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—16.02 dBm. While the women’s speech samples had higher levels, 
they had a lower average power than the men’s. 

This can be explained if during the editing process to produce con- 
tinuous speech, the author allowed more short gaps in the women’s 
speech than the men’s, resulting in a lower activity factor for women. 
This is borne out by measurements of percent of A-to-D samples crossing 
a fixed —25 dBm threshold, which show an average of 35.9 percent 
for women, and 48.3 percent for men. This difference would cause the 
higher level women’s speech to have less average power. 

Once again, we are back to the problem of determining ‘‘when speech 
is present.’’ Even though all editing was done in a two month period by 
the author, and even though the same telephone circuit, recording 
equipment, and editing equipment was used for both sets of data, 
consistent differences exist between the data for men and women. This 
finding should be a further warning that the results found here are 
intended only as guidelines and not as rigid specifications, since they 
can be influenced by the arbitrary design of the speech detection process. 


X. RELATION OF PEAK CLIPPING TO VU 


In an earlier unpublished study, the author suggested that VUs might 
be obtained from epls by subtracting 11 dB from the epl. This was based 
on extensive data from one highly trained observer, plus sample data 
from two other observers. 

In the present study, the epl + 6 dB is suggested as a practical upper 
bound to the speech waveform. Thus, using the 11 dB epl to VU con- 
version, the expected VU would be about 17 dB below the highest peaks. 
This result is remarkably close to Noll’s estimate of 17.2 dB as the 
peak-VU factor.* 

It would appear that the data presented here might be applied to 
VU data by using the 11 dB conversion for epl to VU. A large body of 
VU data exists from field trials and the present results might be applied 
to these data. The author has also shown that VU data taken from 
many observers can be extremely erratic and at times can seem almost 
random.” Because of the variability in VU measurements, care should 
be taken in combining the results of the present study with existing 
VU data. 


XI. SUMMARY 


This study sought to examine speech signal statistics relevant to peak 
clipping and threshold crossings. All measurements were made on 
continuous speech. The following results were obtained: 
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(i) Instantaneous voltages rarely occur at levels higher than the 
epl + 6 dB. 
(ii) The epl is exceeded by the instantaneous waveform about 2.5 
percent of the time. 
(iii) An additional 9 percent of time above threshold is gained for 
every 5 dB drop in threshold below the epl. 
(iv) The average power in continuous speech is about 10 dB below 
the epl. 
(v) The power in a “phantom noise signal’’, representing speech- 
correlated clipping noise, can be approximated by the curve in 
Fig. 4. 
(vi) The power loss in continuous speech resulting from clipping can 
be approximated by the curve in Fig. 6. 
(vii) VUs can be roughly approximated by subtracting 11 dB from 
epls. The other results might then be appropriately modified 
for VU data. 
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B. 8. T. J. BRIEF 


Proof of a Convexity Property 
of the Erlang B Formula 


By E. J. MESSERLI 
(Manuscript received January 5, 1972) 


I. INTRODUCTION 


Consider Poisson traffic offered to a group of n trunks. Blocked calls 
are cleared, and call holding times are independent and identically 
distributed. In equilibrium, the call congestion is given by the well- 
known Erlang B formula’ 


n! 
n a’ 


tao Kl 





B(n, a) = 


where a is the offered load in erlangs. We prove here that 
B(n, a) — B(n +1, a) < Bin — 1, a) — Bln, a) 
m=1,2,--+, (1) 


ie., strict convexity with respect to the number of trunks.‘ For a trunk 
group with sequential assignment of offered calls, these inequalities 
simply state that the load carried on the last trunk is monotonically 
decreasing with the number of trunks—a commonly accepted fact which 
is basic to economic alternate routing in network engineering.” Never- 
theless, analytical verification of (1) has apparently not been published. 
The proof given here offers one approach to verifying convexity prop- 
erties for other loss systems as well. 


II, DEVELOPMENT OF MAIN RESULT 


For an m-trunk group with Poisson offered traffic, consider a class 
of operating policies of the form: 


t Syski! mentions some results for the analytic continuation of B(-, a), but con- 
vexity on the integers does not seem to follow. 
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If a call arrives when there are 7 calls in progress, 7 < m, accept 
the call with probability 5; , where 6; is fixed. 


A policy may be represented by a vector 6(m) = (65), 61, °°* 5 Sm—i) 
For any policy 6(m), identifying states with the number of calls in 
progress, a stationary solution to the birth-and-death equations exists’ 
with carried load given by: 


m k m k -1 
c(8(m)) = ka do *** mal + el dp °° bt) : 


Lemma 1: The carried load c(é(m)) 1s maximized by the unique optimal 
policy 1(m) defined by 6; = 1,7 = 0,1, --- ,m — 1, @e., af there ts an 
empty trunk, accept the call. 


This result is obvious and a proof is straightforward: if 7 is the first 
index such that 5; < 1 for 6(m), it is easy to show that 


55 ca(m)) > 0. 


The result is also implicit in the fundamental inequalities developed by 
BeneS.’ Several proofs for the lemma were later given in Ref. 4. 

For a group of n trunks, with sequential assignment, let the carried 
load on the 7th trunk be given by a; = a;(a),a > 0. Since a, + a,+-::- 
+a, = a(l — B(n, a)), the inequalities (1) hold if and only if 

a > a2 > a3 °°° 
The main result is: 


Theorem 1: The sequence a; , G2, °°: , wherea; = a(B(i — 1, a) — Bi, a)), 
satisfies a, > G2 > -:- for any positive offered load a. 


Proof: Suppose not, i.e., that for some n and a > 0, a,-1 S a,. We 
shall show that this leads to a contradiction of Lemma 1. Thus, consider 
a group of n — 1 trunks with calls placed according to: 


(t) Sequential assignment for the first n — 2 trunks. 

(17) Overflow calls from the first n — 2 trunks are offered to the last 
trunk according to the status of a dummy trunk. If the dummy 
trunk is free, the call is rejected, and a dummy call with the 
same holding time distribution is placed on the dummy trunk. 
If the dummy trunk is busy, the call is offered to the last trunk. 


For this system, define 
P; = Prob (call put up | ¢ real calls in progress). 
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It is easy to see that P; = 1,7 = 0,1, --- , n — 3, and that P,_, < 1. 
Moreover, the carried load is equal to that for an n — 1 trunk group 
system corresponding to the policy 6*(n — 1) = (P,, -:: , Py-2). But 
the carried load c(é*(n — 1)) is given by a, + a2 +--+ + a,-2 + 4,. 
Since a,-1 S @,, 


c(6*(n — 1)) 2a, + ae + +++ + ay. = c(1(n — 1)) 


which contradicts Lemma 1. This completes the proof. 

Remark: Subsequent to the appearance of this development in un- 
published form, other proofs have been given. Krupp’ gives an algebraic 
proof. Descloux® gives both an algebraic proof, and a proof of the 
equivalent result to (1) for renewal input. Buchner and Neal’ also give 
a proof for the generalization to renewal input. A reviewer pointed out 
that the result can be proved by comparing occupancy probabilities 
on the nth trunk for sequential assignment, and for sequential assign- 
ment modified so that an overflow call from the first n ~ 2 trunks 
chooses one of the last two trunks equally likely if both are free. 
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